Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Gibbs sampling wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Sampling (statistics) wikipedia , lookup
Sampling [Start on front page] If you recall from the previous lecture, we left out sampling and have given it a video all its own. As with the other videos, this is a general video intended to put some pieces together, it is not as detailed as the lectures. You still need to read the lectures. Sampling is covered in lectures 3.08 to 3.11. If you remember from the statistical analyses video, when we are analyzing large data sets we want to be able to generalize about the population (but really the sampling frame). That means we must randomly select subjects for our sample—we call this probability sampling. There are multiple ways to sample probabilistically, see the lecture for more detail. Non Probability sampling, while not generalizable, can still be very useful for researchers. For example, if we wanted to do a study of the condition of homelessness in Orlando, we would have to find a way of contacting the homeless. Simple random sampling would not get us the respondents we need. So we might use something like snowball sampling, or we might just use a convenience sample of those outside a shelter. In this case, the study could not really go forward without using some sort of non-probability selection method. [To 3.09] Let’s take a moment and talk more in depth about probability sampling. First, remember that we have a population we want to know about. We take a random sample from that population and then we quantitatively analyze the data. Our data analysis (such as regression) gives us estimates about our variables of interest that we want to generalize back to the population (but really the sampling frame). Remember we want randomness and representativeness. When I was learning statistics in graduate school, one common phrase was ‘representativeness through randomness’. Basically, randomly selected samples are very good at producing representative samples—so that’s nice. However, one still must check (via frequency tables, distributions, and central tendency statistics). If the dada is not representative, it produces sample bias. The example in the previous lecture was a sample of US adults that was 60% male. If the gender and politics folks are correct, then men and women are not likely to answer survey questions the same (like preferences and behaviors). What this means is our sample is biased toward the male gender, so our responses to preference and behavior questions will likely also not be representative of the population. [To 3.10] Ok, moving on: So we have a sample of a population, we analyze it, we get these estimates. What does it all mean? Well in our population, we there exists some parameter (an average, a proportion, etc.)—but we can never really know the true/real value because we cannot sample all the elements in the population (for example, we really cannot conduct an election survey of all US adults—just think about the difficult in conducting the census, and that is every 10 years with a very limited amount of demographic information). So the estimate we produce with our sample (the sample estimate) is an estimate of the true population parameter. Our sample error is the difference between the true population parameter and the sample estimate (this is not the same as sample bias)—it is a random error we can only reduce by increasing our sample size. Here’s why: because the more people we sample for a survey, the closer we get to sampling everyone—it is the law of large numbers—(if we sampled everyone we would have the true population parameter, and thus would have no error). At the end of lecture 3.10 you will see some review of the central tendency and distributions lecture. Make sure you understand that before moving on. [To 3.11] OK, so how sure are we that our estimate represents the population? To answer this, we turn to the central limit theorem. Up to this point we have been talking about one sample from the population. But, what if we took 100 different samples of the same population? For each sample we have a distribution and a sample estimate (such as a mean or percentage). Now, suppose we took all the estimates (means) from the samples and created a distribution with them? So we would now have a distribution of sample means. For this distribution of sample means, we have the true population mean, and we have the standard error of the means. The true population mean can be thought of as the mean of our distribution. (So, instead of a sample distribution and a mean of the sample, we now have a distribution of sample means and the true population mean). The standard error of the means can be thought of as the standard deviation. That is, the standard error of the means describes the distribution of the sample means around the center (the center is the true population mean)—it describes the variation of the distribution (narrow or wide). Well here’s where it gets good. Remember the 68-95-99.7 Rule? Well the Central Limit Theorem tells us that the distribution of sample means is normal. Even if our samples have skewed distributions, when we create a distribution of sample means (of those sample distributions) it is normal. So, good news! We can apply the 68-95-99.7 Rule! For a distribution of sample means: 68% of the values fall within 1 standard error of the true population mean; 95% of values fall within 2 standard errors of the true population mean; 99.7% of values fall within 3 standard errors of the mean. Now, let’s try to put this all together: Remember from Video 2.1 that we have a measurement that has a distribution and a mean (or some parameter). So, based on the Central Limit Theorem, our sample mean (or other sample parameter) has a 95% chance of being within 2 standard errors of the true population mean. The larger the standard error, the larger the range of possible values that the true population mean can take, and the lower the likelihood that our sample parameter is close to the true population mean. Remember back in the regression tables? We are testing against the hypothesis that no relationship exists -- that x has no influence on y. For each explanatory (independent) variable, we have a beta coefficient (or parameter) that is an estimate of the influence that variable on the dependent variable. Reported alongside the parameter is the standard error. Now consider for a moment a situation where the parameter estimate is 5 and the standard error is 3. So, for a 95% level of confidence in the relationship, we have: 5 +/– (2*3) = 5 +/- 6. Therefore the true population parameter (which is an estimate of the influence of x on y) is somewhere between -1 and 11. This includes 0. Zero is a parameter that indicates that x has no influence on y. When the parameter is 0, there is no relationship between x and y. Our confidence interval ranges from -1 to 11, so we cannot confirm that the true population parameter is not 0. We cannot reject the null hypothesis of no relationship, and we cannot confirm our hypothesis (the alternative hypothesis). In sum, when you are looking at the parameter estimates and standard errors in a regression model, and twice a standard error is larger than the parameter estimate itself, the probability of no relationship is high (p-value is greater than 0.05). Ok, now let’s tackle the margin of error. Some of you that look at a lot of public opinion polls may already be quite familiar with it. Sampling error, or the margin of error, is random error that occurs when we sample from a population. Remember, the law of large numbers tells us that the larger our sample, the closer we are to sampling the entire population, and the closer our sample parameters will be to the true population parameters. So, the larger the sample, the smaller the margin of error. Margin of error is extremely relevant for surveys because it tells us how close a sample parameter is to the true population parameter. For example, if we have a survey question that asks whether people agree or disagree with some public policy (so a dummy measurement with only agree and disagree as the choices). We find that 53% of people agree with the policy. This seems like it might be a majority, right? But we cannot say that until we look at the margin of error. We use the equation and find that because the sample size was so small (500 people), the margin of error is about 4.4%. This means that the percentage of people in the population that agree with the policy ranges between 48.6% and 57.6%. That means we cannot actually say that there is majority agreement (because the range of possible values does not remain above 50%). With that, I am concluding the video on sampling.