Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sampling Theory and Surveys GV917 Introduction to Sampling In statistics the population refers to the total universe of objects being studied. Examples include: All voters in the UK All graduate students at the University of Essex These are finite populations, but we also meet infinite populations such as: All possible rolls of a six sided dice All possible turns of a roulette wheel The Purpose of Sampling We take samples in order to: Estimate population characteristics or parameters – e.g. the average age of all voters in the UK Test hypotheses about a population eg Did 60 per cent of women turn out to vote in a general election? Hypothetical Population Suppose we have a population consisting of five numbers: 3, 5, 7, 9, 11 The sum of this population is 35 and the mean is 7 (denoted µ) Now suppose we are trying to infer this population mean from a single random sample of size 2. How likely is it that we will infer the population mean correctly? Samples of Size Two From a Population of Size Five Sample Number: 1 2 3 4 5 6 7 8 9 10 _________________________________ 3 3 3 3 5 5 5 7 7 9 5 7 9 11 7 9 11 9 11 11 _________________________________ Sum 8 10 12 14 12 14 16 16 18 20 Mean 4 5 6 7 6 7 8 8 9 10 The Sampling Distribution – The likelihood of different samples occurring Sample Probability Sample*Probability Mean(x) p(x) p(x).X 4 0.10 0.40 5 0.10 0.50 6 0.20 1.20 7 0.20 1.40 8 0.20 1.60 9 0.10 0.90 10 0.10 1.00 -----Mean of the Means E(X)= Σ 7.00 (the expected value) A Simple Confidence Interval Estimate of the Population Mean _ Point Estimate µ = X (probability of being correct = 0.20) _ Interval Estimate µ = X + or – 1.0 (probability of being correct = 0.60) _ µ = X + or – 2.0 (probability of being correct = 0.80) The Standard Deviation of the Sampling Distribution Sample Probability Deviations Deviations Deviations Mean (X) p(X) (X – E(X)) Squared Squared*Probability (X – E(X))2 p(X).(X-E(x))2 4 0.10 -3 9 0.90 5 0.10 -2 4 0.40 6 0.20 -1 1 0.20 7 0.20 0 0 0 8 0.20 +1 1 0.20 9 0.10 +2 4 0.40 10 0.10 +3 9 0.90 -----Σ 3.00 Standard Error (σx)= √ [Σ p(X).(X-E(x))2] (average error) = √3.0 = 1.73 Using the Standard Error in a Confidence Interval _ µ = X + or – standard error _ µ = X + or – 1.73 (probability of being correct = 0.60) A multiple of the Standard Error _ µ = X + or – 1.73 * 1.5 _ µ = X + or – 2.6 (probability of being correct = 0.80) The Sampling Distribution with Large Samples – The Normal Distribution Confidence intervals with the Normal Distribution µ = X + or – σx [probability of being correct of 0.68] µ = X + or – 1.96*σx [probability of being correct of 0.95] µ = X + or – 2.58*σx [probability of being correct of 0.99] But how can we know the standard error with only one sample? In practical applications we cannot calculate the sampling distribution directly because there are millions of possible samples of size say, 1,000, which can be taken from a population of 60 million (the approximate size of the UK population). A powerful theorem in statistics called the Central Limit Theorem enables us to infer the standard error from one sample only The intuition behind this is that large enough sample is going to provide a measure of the variability of all samples taken from a given population providing that any sample can be chosen Thus if a random sample is very variable, then different random samples taken from that population are going to be quite variable too. If a random sample is not very variable then it suggests that samples taken from the population will not vary much either Calculating the Standard Error The theorem shows that: σx = s/√n Where σx is the standard error of the mean s is the sample standard deviation n is the sample size A confidence Interval from the 2005 BES Descriptive Statistics N aq16a Feelings About Labour Party aq16b Feelings About Cons ervative Party aq16c Feelings About Liberal Democrat Party Valid N (lis twis e) Minimum Maximum 3517 .00 10.00 5.0446 2.60174 3470 .00 10.00 4.4026 2.42338 3396 .00 10.00 4.7400 2.04121 3390 Feelings about Labour µ = X + or – 1.96*σx [probability of being correct of 0.95] µ = 5. 0446 + or - 1.96 * (2.6017/√3517) µ = 5. 0446 + or - 0.086 µ = 4.9586 to 5.1306 (probability of being correct = 0.95) Mean Std. Deviation Complications The calculation assumed that the BES is a simple random sample of the UK voting population, that is every adult in the country has an equal chance of being selected for the sample. But if we used a simple random sample respondents would be evenly spread across the country, involving a lot of travel time and costs for the interviewers. Costs can be reduced by ‘clustering’ the sample – that is choosing people who live relatively close together. This is done by sampling in stages – first constituencies, then wards and finally individuals. The accuracy of the sample can be improved by stratifying it – ensuring that groups appear in the sample exactly in the proportions as they appear in the population. In the 2005 general election 26.6 per cent of the seats had majorities less than 10 per cent – these were the marginal seats that decided the election. In a new sample there is an advantage in making sure that exactly 26.6 per cent of the constituencies are marginal seats. A simple random sample would not necessarily deliver this – it might deliver 25 per cent by chance. So we improve accuracy by replicating the known characteristics of the population. This is called stratifying by marginality. Sampling in Practice – the BES 2005 We might want to over-sample some groups because they have interesting political characteristics and a simple random sample would provide too few cases for analysis. This was done in Scotland in 2005. Scots make up about 9 per cent of the British population, but just over 25 per cent of the BES sample in 2005 came from Scotland, because we wanted enough cases to analyse Scottish politics, which is rather different from England. Of course any analysis of voters in Britain as a whole has to weight the sample, to make sure that the Scots are represented accurately. The survey was designed to yield a representative sample of adults aged 18 or above living in private households in Britain (excluding the area north of the Caledonian Canal). Adults living in Northern Ireland were excluded from the study. The sample was drawn from the Postcode Address File, a list of addresses (or postal delivery points) compiled by the Post Office. For practical reasons, samples are confined to those living in private households. People living in institutions (though not in private households at such institutions) are excluded, as are households whose addresses are not on the Postcode Address File. The sampling method involved a clustered multi-stage design, with three separate stages of selection. Sampling in Practice – The BES in 2005 In the first instance, 128 constituencies were sampled at random: 77 in England, 29 in Scotland and 22 in Wales, using stratification on marginality of election results, geographic regions and population density. (In Wales, percent Welsh-speakers was used instead of geographic region). Scottish and Welsh constituencies were oversampled to achieve Scottish and Welsh boost samples. In England, marginal constituencies were slightly over-sampled. Within each constituency, two wards were sampled at random, giving 256 sample points. At each sample point (ward), addresses were selected with equal probability across the sample point. More addresses were selected in Scottish and Welsh sample points than in English ones (27 compared with 24) – again, in order to achieve Scottish and Welsh boost samples. Using random methods, the interviewer then selected one person for interview at each address. Sample Precision The sample precision is measured by the size of the standard errors. If we stratify the sample this increases precision, (reduces the size of the standard error). If we cluster this decreases it Non-response can decrease precision if the nonrespondents differ from the respondents – which they generally do. They tend to be less interested in politics and less likely to vote, so we need to weight the sample to correct for this source of bias Response Rates in the 2005 BES pre-election survey N % Addresses issued 6,450 Out of scope (eg derelict building) 515 Eligible 5,935 100.0% Interview achieved 3,589 60.5 % Interview not achieved because: 2,346 39.5 % Refused 1,679 28.3% Not contacted (eg someone who moved without a forwarding 382 address) 6.4% Other unproductive (eg too ill to talk to interviewers) 4.8% 285 Weighting in the BES The Scots are over-represented in the sample, so if we want to analyse Britain as a whole they have to be reduced in numbers or weighted. On average if each Scot in the sample counts only 0.3404 of a person this corrects their over-representation. Thus 0.3403*933 = 318 Scots in the weighted sample, which is 8.8 per cent of the total of 3589. This is the correct proportion of Scots in Britain. prewtbr Weight for GB [calibrated] acountry Country 1 England 2 Scotland 3 Wales Total Mean 1.5339 .3404 .2836 1.0000 N 2014 933 642 3589 Std. Deviation .86852 .15177 .13697 .89304 Unweighted Party Voting in 2010 bq12_2 P arty Vote 2010 Genera l El ecti on Valid Missing Total Frequency -2. 00 Refused 101 -1. 00 Don't Know 11 1.00 Labour 731 2.00 Cons ervatives 815 3.00 Liberal Democrat s 500 4.00 S cot tish Nat ional 105 Party (SNP ) 5.00 P laid Cy mru 16 6.00 Green Party 24 7.00 United K ingdom Independence Part y 46 (UKIP) 8.00 B ritis h National 32 Party (BNP ) 9.00 Other 11 Total 2392 Sy stem 1120 3512 Percent 2.9 .3 20.8 23.2 14.2 Valid P erc ent 4.2 .5 30.6 34.1 20.9 Cumulative Percent 4.2 4.7 35.2 69.3 90.2 3.0 4.4 94.6 .5 .7 .7 1.0 95.3 96.3 1.3 1.9 98.2 .9 1.3 99.5 .3 68.1 31.9 100.0 .5 100.0 100.0 Weighted Party Voting in 2010 (weighted for post-election analysis) bq12_2 P arty Vote 2010 Genera l El ecti on Valid Mi ssing Total -2. 00 Refused -1. 00 Don't Know 1.00 Labour 2.00 Cons ervatives 3.00 Liberal Democrat s 4.00 S cot tish Nat ional Party (SNP ) 5.00 P laid Cy mru 6.00 Green Party 7.00 United K ingdom Independence Part y (UKIP) 8.00 B ritis h National Party (BNP ) 9.00 Other Total Sy stem Valid P erc ent 3.7 .5 30.7 36.0 22.5 Cumul ative Percent 3.7 4.1 34.9 70.9 93.4 Frequency 86 11 725 849 531 Percent 2.8 .4 23.6 27.6 17.3 38 1.2 1.6 95.0 10 23 .3 .8 .4 1.0 95.5 96.4 45 1.5 1.9 98.3 30 1.0 1.3 99.6 9 2357 718 3075 .3 76.7 23.3 100.0 .4 100.0 100.0 The Effects of Weighting The Actual Party Vote Shares in 2010 were: Labour 29.0% Conservatives 36.1% Liberal Democrats 23.0% Others 11.9% The weighted Conservative and Liberal Democrats vote shares are clearly more accurate than the unweighted ones Conclusions Statistical Theory helps us to make inferences about populations from much smaller samples Inferences are possible because everyone in the population has a (small) chance of ending up in the sample – therefore the sample is representative In practice the calculation of sampling errors is complicated by various sample design factors aimed at making surveys less costly