Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript

Sampling. Fundamental principles. Daniel Gile [email protected] www.cirinandgile.com Gile Sampling 1 Why sample? (1) In research There is often (but not always) an attempt To generalize on the basis of a limited number of observations Because access is available only to part of reality If reality was homogenous One observation would be enough (or two or three to make sure no error of observation or measurement was made) Gile Sampling 2 Why sample? (2) But reality is generally complex with variability It is therefore necessary to find a means to make sure (to the best possible extent) that the part of reality which will be observed/measured will represent adequately the whole phenomenon in which one is interested Sampling is a set of methods which seeks to ensure that whatever will be observed or measured will be as similar as possible in its relevant features to the whole phenomenon under study Gile Sampling 3 Representative samples and sampling error A sample should be representative – that is its most important feature. In statistical terms, this does not mean that it should have exactly the same characteristics as the whole phenomenon which is the object of research (the population) Some difference between the features of the sample and the features of the population is always possible, and even very likely. Such a difference is called ‘sampling error’, though it is not an ‘error’ in the usual sense of the word. A representative sample is not one without a sampling error. It is one without bias, that is, a systematic deviation from the features of the population (generally either systematically more, or systematically less) Gile Sampling 4 Samples and populations In statistics, the set of individuals, objects, processes, events, situations or other entities that are the object of research are called the population The sample is a subset of this population It has a size, namely the number of entities of which it is made up Population Sample Gile Sampling 5 What is measured in samples? Generally, what is measured in the sample is one of the features of the entities it is made of, in order to evaluate its value in the population (percentage of unemployed, students’ marks in a test, mean time to perform a task…) Two very important measurements in a sample are those of its mean value, the sample mean and the value of its standard deviation, which is a rough approximation of how much individual values in the sample vary around the mean In a representative sample, The mean is an approximation of the population mean The standard deviation gives us an idea about the degree of uncertainty we have in our inferences Gile Sampling 6 Representative samples, biased samples (1) If a sample is representative of the population, measurements on any of the units of which it is made can vary greatly around the population mean. It may be higher, it may be lower, sometimes considerably, but it is expected that the mean value of these measurements will tend to be closer to the population mean than extreme values which could be found. Sample means can nevertheless be sometimes higher, sometimes lower than the population mean. If a representative sample is drawn and its mean is, calculated, and then another representative sample is drawn and its mean is calculated and so on, the successive means should tend to be equally spread around the population mean. Gile Sampling 7 Representative samples, biased samples (2) When a large number of samples has been drawn and their sample means have been calculated, their mean (the mean of the means) will be very close to the population mean… provided these are representative samples. If the samples are biased in some way, their means will tend to be systematically either above or below the population mean, and this tendency will not disappear no matter how many samples are drawn. Gile Sampling 8 Sampling error and sample size The variability of sampling error (the random difference between sample mean and population mean) can be reduced by increasing sample size. However, the reduction is proportional not to the increase in sample size, but to the square root of the coefficient by which it is increased In other words, in order to reduce such variability by half, you need to increase sample size by a factor of 4. If you want to reduce it by 75%, you have to multiply the sample size by 16. In concrete terms, this means that sampling costs increase exponentially for relatively little gain. This is why generally, sampling is not done with thousands of units. Gile Sampling 9 Other ways of reducing sampling error Another way of reducing sampling error consists in using more precise sampling methods, provided available information on the population allows it. For instance, if it is known that in a given population, 70% belong to ethnic group A 20% belong to ethnic group B 10% belong to ethnic group C (and it is believed that the ethnic groups are relevant) In a simple random sample, some ethnic groups could be overrepresented or under-represented. In order to reduce the error, random sampling can be done with 70% of people from group A, 20% from group B, 10% from group C. This method is called stratified sampling Gile Sampling 10 But if there is so much uncertainty, is it legitimate at all to make inferences? Mathematical calculations based on probability theory make it possible to assess the probability the mean calculated on a representative sample stays within a certain distance from the population mean. (Confidence interval) This is only an estimation, but it is likely to be true with a certain probability, so it is helpful in making inferences... But there is no absolute certainty…. Especially if there is a hidden bias somewhere. Gile Sampling 11 So how do you know that a sample is representative (i.e. not biased)? The only way to do away with any risk of bias Is to conduct random sampling, That is, A drawing samples where each unit in the population has the probability of being drawn. This can be done with a random number table, or with a computer that generates a quasi-random sequence of numbers But any human method based on some sort of rationale other than the generation of random phenomena is associated with risks of introducing some hidden bias. Gile Sampling 12 What happens in the field? In human and social sciences, only rarely is it possible to actually do random sampling If only because only rarely does one have a full list of all persons in a population, so that those to be in the sample can be chosen at random from the whole population. Also, even when such a list is available, only rarely will those drawn into the sample actually accept to participate. As a result, most of the sampling done is non random, convenience sampling or sampling with ‘self-selection’ by participants/respondents So one cannot be certain that the sample is not biased. So strictly speaking, if one applies strict scientific norms, inferences drawn from such samples are not up to standards. Gile Sampling 13 Implications (1) This does not invalidate the approach totally Especially if care is taken to draw samples that one believes to be representative Depending on the investigator’s knowledge of the phenomenon under study and of his/her beliefs as to where biases could lie. But there is no certainty, and there is a subjective, arbitrary part in one’s assessment of the findings and their reliability Gile Sampling 14 Implications (2) Scientific caution therefore calls for tentative conclusions, not strong claims Statistical analyses can be carried out, but ideally, readers of the report should be reminded that the samples are not necessarily representative In come cases, when the investigator knows that the sample is part of a well-defined subset of the population with its own feature (young people, people from a certain cultural background, from a certain social class etc.) It is desirable to point out one’s awareness of the possibility of bias arising from specific features of this subset of the population. Gile Sampling 15 Implications (3) Summing up this particular issue, Unless a sample is truly random Any generalization can only be tentative, Which means that findings in a single study do not prove anything It is only through the accumulation of findings pointing in the same direction That the idea that they can be generalized gradually gains ground Gile Sampling 16 Samples and case studies But in that case, is there any fundamental difference between studies on samples and studies on single cases (case studies)? After all, the accumulation of convergent results in case studies has the same effect as the accumulation of convergent findings in studies on samples Indeed, but studies on samples are more powerful, because they reduce variability, since a sample mean is very likely to be closer to the population mean than a random individual value Nevertheless, case studies are legitimate, and in concrete terms of feasibility, they can sometimes be replicated more often than studies on samples… and each case study often makes it possible to conduct a more thorough, deeper investigation than studies on samples Gile Sampling 17