Download Sampling Theory and Surveys

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Sampling Theory and Surveys
GV917
Introduction to Sampling






In statistics the population refers to the total
universe of objects being studied. Examples
include:
All voters in the UK
All graduate students at the University of Essex
These are finite populations, but we also meet
infinite populations such as:
All possible rolls of a six sided dice
All possible turns of a roulette wheel
The Purpose of Sampling



We take samples in order to:
Estimate population characteristics or
parameters – e.g. the average age of all
voters in the UK
Test hypotheses about a population eg Did
60 per cent of women turn out to vote in a
general election?
Hypothetical Population




Suppose we have a population consisting of
five numbers: 3, 5, 7, 9, 11
The sum of this population is 35 and the
mean is 7 (denoted µ)
Now suppose we are trying to infer this
population mean from a single random
sample of size 2.
How likely is it that we will infer the population
mean correctly?
Samples of Size Two From a
Population of Size Five








Sample
Number: 1 2 3 4 5 6 7 8 9 10
_________________________________
3 3 3 3 5 5 5 7 7 9
5 7 9 11 7 9 11 9 11 11
_________________________________
Sum
8 10 12 14 12 14 16 16 18 20
Mean 4 5 6 7 6 7 8 8 9 10
The Sampling Distribution – The likelihood
of different samples occurring











Sample Probability Sample*Probability
Mean(x) p(x)
p(x).X
4
0.10
0.40
5
0.10
0.50
6
0.20
1.20
7
0.20
1.40
8
0.20
1.60
9
0.10
0.90
10
0.10
1.00
-----Mean of the Means E(X)= Σ 7.00 (the expected value)
A Simple Confidence Interval Estimate of
the Population Mean






_
Point Estimate
µ = X (probability of being correct = 0.20)
_
Interval Estimate
µ = X + or – 1.0 (probability of being correct = 0.60)
_
µ = X + or – 2.0 (probability of being correct = 0.80)
The Standard Deviation of the Sampling
Distribution













Sample Probability Deviations Deviations
Deviations
Mean (X) p(X)
(X – E(X)) Squared Squared*Probability
(X – E(X))2
p(X).(X-E(x))2
4
0.10
-3
9
0.90
5
0.10
-2
4
0.40
6
0.20
-1
1
0.20
7
0.20
0
0
0
8
0.20
+1
1
0.20
9
0.10
+2
4
0.40
10
0.10
+3
9
0.90
-----Σ 3.00
Standard Error (σx)= √ [Σ p(X).(X-E(x))2] (average error)
= √3.0 = 1.73
Using the Standard Error in a Confidence
Interval










_
µ = X + or – standard error
_
µ = X + or – 1.73 (probability of being correct = 0.60)
A multiple of the Standard Error
_
µ = X + or – 1.73 * 1.5
_
µ = X + or – 2.6 (probability of being correct = 0.80)
The Sampling Distribution with Large Samples –
The Normal Distribution
Confidence intervals with the Normal
Distribution




µ = X + or – σx [probability of being correct
of 0.68]
µ = X + or – 1.96*σx [probability of being
correct of 0.95]
µ = X + or – 2.58*σx [probability of being
correct of 0.99]
But how can we know the standard error
with only one sample?

In practical applications we cannot calculate the sampling
distribution directly because there are millions of possible samples
of size say, 1,000, which can be taken from a population of 60
million (the approximate size of the UK population).


A powerful theorem in statistics called the Central Limit Theorem
enables us to infer the standard error from one sample only

The intuition behind this is that large enough sample is going to
provide a measure of the variability of all samples taken from a
given population providing that any sample can be chosen

Thus if a random sample is very variable, then different random
samples taken from that population are going to be quite variable
too. If a random sample is not very variable then it suggests that
samples taken from the population will not vary much either
Calculating the Standard Error
The theorem shows that:

σx = s/√n



Where σx is the standard error of the mean
s is the sample standard deviation
n is the sample size
A confidence Interval from the 2005 BES
Descriptive Statistics
N
aq16a Feelings About
Labour Party
aq16b Feelings About
Cons ervative Party
aq16c Feelings About
Liberal Democrat Party
Valid N (lis twis e)
Minimum
Maximum
3517
.00
10.00
5.0446
2.60174
3470
.00
10.00
4.4026
2.42338
3396
.00
10.00
4.7400
2.04121
3390

Feelings about Labour

µ = X + or – 1.96*σx [probability of being correct of 0.95]

µ = 5. 0446 + or - 1.96 * (2.6017/√3517)

µ = 5. 0446 + or - 0.086

µ = 4.9586 to 5.1306 (probability of being correct = 0.95)
Mean
Std. Deviation
Complications



The calculation assumed that the BES is a simple random sample of
the UK voting population, that is every adult in the country has an
equal chance of being selected for the sample.
But if we used a simple random sample respondents would be
evenly spread across the country, involving a lot of travel time and
costs for the interviewers. Costs can be reduced by ‘clustering’ the
sample – that is choosing people who live relatively close together.
This is done by sampling in stages – first constituencies, then wards
and finally individuals.
The accuracy of the sample can be improved by stratifying it –
ensuring that groups appear in the sample exactly in the proportions
as they appear in the population. In the 2005 general election 26.6
per cent of the seats had majorities less than 10 per cent – these
were the marginal seats that decided the election. In a new sample
there is an advantage in making sure that exactly 26.6 per cent of
the constituencies are marginal seats. A simple random sample
would not necessarily deliver this – it might deliver 25 per cent by
chance. So we improve accuracy by replicating the known
characteristics of the population. This is called stratifying by
marginality.
Sampling in Practice – the BES 2005



We might want to over-sample some groups because they have
interesting political characteristics and a simple random sample
would provide too few cases for analysis. This was done in
Scotland in 2005. Scots make up about 9 per cent of the British
population, but just over 25 per cent of the BES sample in 2005
came from Scotland, because we wanted enough cases to analyse
Scottish politics, which is rather different from England. Of course
any analysis of voters in Britain as a whole has to weight the
sample, to make sure that the Scots are represented accurately.
The survey was designed to yield a representative sample of adults
aged 18 or above living in private households in Britain (excluding
the area north of the Caledonian Canal). Adults living in Northern
Ireland were excluded from the study. The sample was drawn from
the Postcode Address File, a list of addresses (or postal delivery
points) compiled by the Post Office. For practical reasons, samples
are confined to those living in private households. People living in
institutions (though not in private households at such institutions) are
excluded, as are households whose addresses are not on the
Postcode Address File.
The sampling method involved a clustered multi-stage design, with
three separate stages of selection.
Sampling in Practice – The BES in 2005



In the first instance, 128 constituencies were sampled at random: 77
in England, 29 in Scotland and 22 in Wales, using stratification on
marginality of election results, geographic regions and population
density. (In Wales, percent Welsh-speakers was used instead of
geographic region). Scottish and Welsh constituencies were oversampled to achieve Scottish and Welsh boost samples. In England,
marginal constituencies were slightly over-sampled.
Within each constituency, two wards were sampled at random,
giving 256 sample points.
At each sample point (ward), addresses were selected with equal
probability across the sample point. More addresses were selected
in Scottish and Welsh sample points than in English ones (27
compared with 24) – again, in order to achieve Scottish and Welsh
boost samples. Using random methods, the interviewer then
selected one person for interview at each address.
Sample Precision


The sample precision is measured by the size of
the standard errors. If we stratify the sample this
increases precision, (reduces the size of the
standard error). If we cluster this decreases it
Non-response can decrease precision if the nonrespondents differ from the respondents – which
they generally do. They tend to be less
interested in politics and less likely to vote, so
we need to weight the sample to correct for this
source of bias
Response Rates in the 2005 BES pre-election
survey
N
%
Addresses issued
6,450
Out of scope (eg derelict building)
515
Eligible
5,935
100.0%
Interview achieved
3,589
60.5 %
Interview not achieved because:
2,346
39.5 %
Refused
1,679
28.3%
Not contacted (eg someone who moved without a forwarding 382
address)
6.4%
Other unproductive (eg too ill to talk to interviewers)
4.8%
285
Weighting in the BES

The Scots are over-represented in the sample, so if we want to
analyse Britain as a whole they have to be reduced in numbers or
weighted. On average if each Scot in the sample counts only 0.3404
of a person this corrects their over-representation.

Thus 0.3403*933 = 318 Scots in the weighted sample, which is 8.8
per cent of the total of 3589. This is the correct proportion of Scots
in Britain.
prewtbr Weight for GB [calibrated]
acountry Country
1 England
2 Scotland
3 Wales
Total
Mean
1.5339
.3404
.2836
1.0000
N
2014
933
642
3589
Std.
Deviation
.86852
.15177
.13697
.89304
Unweighted Party Voting in 2010
bq12_2 P arty Vote 2010 Genera l El ecti on
Valid
Missing
Total
Frequency
-2. 00 Refused
101
-1. 00 Don't Know
11
1.00 Labour
731
2.00 Cons ervatives
815
3.00 Liberal Democrat s
500
4.00 S cot tish Nat ional
105
Party (SNP )
5.00 P laid Cy mru
16
6.00 Green Party
24
7.00 United K ingdom
Independence Part y
46
(UKIP)
8.00 B ritis h National
32
Party (BNP )
9.00 Other
11
Total
2392
Sy stem
1120
3512
Percent
2.9
.3
20.8
23.2
14.2
Valid P erc ent
4.2
.5
30.6
34.1
20.9
Cumulative
Percent
4.2
4.7
35.2
69.3
90.2
3.0
4.4
94.6
.5
.7
.7
1.0
95.3
96.3
1.3
1.9
98.2
.9
1.3
99.5
.3
68.1
31.9
100.0
.5
100.0
100.0
Weighted Party Voting in 2010
(weighted for post-election analysis)
bq12_2 P arty Vote 2010 Genera l El ecti on
Valid
Mi ssing
Total
-2. 00 Refused
-1. 00 Don't Know
1.00 Labour
2.00 Cons ervatives
3.00 Liberal Democrat s
4.00 S cot tish Nat ional
Party (SNP )
5.00 P laid Cy mru
6.00 Green Party
7.00 United K ingdom
Independence Part y
(UKIP)
8.00 B ritis h National
Party (BNP )
9.00 Other
Total
Sy stem
Valid P erc ent
3.7
.5
30.7
36.0
22.5
Cumul ative
Percent
3.7
4.1
34.9
70.9
93.4
Frequency
86
11
725
849
531
Percent
2.8
.4
23.6
27.6
17.3
38
1.2
1.6
95.0
10
23
.3
.8
.4
1.0
95.5
96.4
45
1.5
1.9
98.3
30
1.0
1.3
99.6
9
2357
718
3075
.3
76.7
23.3
100.0
.4
100.0
100.0
The Effects of Weighting






The Actual Party Vote Shares in 2010 were:
Labour 29.0%
Conservatives 36.1%
Liberal Democrats 23.0%
Others 11.9%
The weighted Conservative and Liberal
Democrats vote shares are clearly more
accurate than the unweighted ones
Conclusions



Statistical Theory helps us to make inferences
about populations from much smaller samples
Inferences are possible because everyone in the
population has a (small) chance of ending up in
the sample – therefore the sample is
representative
In practice the calculation of sampling errors is
complicated by various sample design factors
aimed at making surveys less costly