Download Sampling - Webcourses

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Gibbs sampling wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Sampling
[Start on front page] If you recall from the previous lecture, we left out sampling and have given it a
video all its own. As with the other videos, this is a general video intended to put some pieces together, it
is not as detailed as the lectures. You still need to read the lectures.
Sampling is covered in lectures 3.08 to 3.11. If you remember from the statistical analyses video, when
we are analyzing large data sets we want to be able to generalize about the population (but really the
sampling frame). That means we must randomly select subjects for our sample—we call this probability
sampling. There are multiple ways to sample probabilistically, see the lecture for more detail. Non
Probability sampling, while not generalizable, can still be very useful for researchers. For example, if we
wanted to do a study of the condition of homelessness in Orlando, we would have to find a way of
contacting the homeless. Simple random sampling would not get us the respondents we need. So we
might use something like snowball sampling, or we might just use a convenience sample of those outside
a shelter. In this case, the study could not really go forward without using some sort of non-probability
selection method.
[To 3.09] Let’s take a moment and talk more in depth about probability sampling. First, remember that
we have a population we want to know about. We take a random sample from that population and then
we quantitatively analyze the data. Our data analysis (such as regression) gives us estimates about our
variables of interest that we want to generalize back to the population (but really the sampling frame).
Remember we want randomness and representativeness. When I was learning statistics in graduate
school, one common phrase was ‘representativeness through randomness’. Basically, randomly selected
samples are very good at producing representative samples—so that’s nice. However, one still must
check (via frequency tables, distributions, and central tendency statistics). If the dada is not
representative, it produces sample bias. The example in the previous lecture was a sample of US adults
that was 60% male. If the gender and politics folks are correct, then men and women are not likely to
answer survey questions the same (like preferences and behaviors). What this means is our sample is
biased toward the male gender, so our responses to preference and behavior questions will likely also not
be representative of the population.
[To 3.10] Ok, moving on: So we have a sample of a population, we analyze it, we get these estimates.
What does it all mean? Well in our population, we there exists some parameter (an average, a proportion,
etc.)—but we can never really know the true/real value because we cannot sample all the elements in the
population (for example, we really cannot conduct an election survey of all US adults—just think about
the difficult in conducting the census, and that is every 10 years with a very limited amount of
demographic information). So the estimate we produce with our sample (the sample estimate) is an
estimate of the true population parameter. Our sample error is the difference between the true population
parameter and the sample estimate (this is not the same as sample bias)—it is a random error we can only
reduce by increasing our sample size. Here’s why: because the more people we sample for a survey, the
closer we get to sampling everyone—it is the law of large numbers—(if we sampled everyone we would
have the true population parameter, and thus would have no error). At the end of lecture 3.10 you will see
some review of the central tendency and distributions lecture. Make sure you understand that before
moving on.
[To 3.11] OK, so how sure are we that our estimate represents the population? To answer this, we turn to
the central limit theorem. Up to this point we have been talking about one sample from the population.
But, what if we took 100 different samples of the same population? For each sample we have a
distribution and a sample estimate (such as a mean or percentage). Now, suppose we took all the
estimates (means) from the samples and created a distribution with them? So we would now have a
distribution of sample means. For this distribution of sample means, we have the true population mean,
and we have the standard error of the means. The true population mean can be thought of as the mean of
our distribution. (So, instead of a sample distribution and a mean of the sample, we now have a
distribution of sample means and the true population mean). The standard error of the means can be
thought of as the standard deviation. That is, the standard error of the means describes the distribution of
the sample means around the center (the center is the true population mean)—it describes the variation of
the distribution (narrow or wide).
Well here’s where it gets good. Remember the 68-95-99.7 Rule? Well the Central Limit Theorem tells
us that the distribution of sample means is normal. Even if our samples have skewed distributions,
when we create a distribution of sample means (of those sample distributions) it is normal. So, good
news! We can apply the 68-95-99.7 Rule! For a distribution of sample means: 68% of the values fall
within 1 standard error of the true population mean; 95% of values fall within 2 standard errors of the true
population mean; 99.7% of values fall within 3 standard errors of the mean.
Now, let’s try to put this all together: Remember from Video 2.1 that we have a measurement that has a
distribution and a mean (or some parameter). So, based on the Central Limit Theorem, our sample mean
(or other sample parameter) has a 95% chance of being within 2 standard errors of the true population
mean. The larger the standard error, the larger the range of possible values that the true population mean
can take, and the lower the likelihood that our sample parameter is close to the true population mean.
Remember back in the regression tables? We are testing against the hypothesis that no relationship exists
-- that x has no influence on y. For each explanatory (independent) variable, we have a beta coefficient
(or parameter) that is an estimate of the influence that variable on the dependent variable. Reported
alongside the parameter is the standard error. Now consider for a moment a situation where the parameter
estimate is 5 and the standard error is 3. So, for a 95% level of confidence in the relationship, we have: 5
+/– (2*3) = 5 +/- 6. Therefore the true population parameter (which is an estimate of the influence of x
on y) is somewhere between -1 and 11. This includes 0. Zero is a parameter that indicates that x has no
influence on y. When the parameter is 0, there is no relationship between x and y. Our confidence
interval ranges from -1 to 11, so we cannot confirm that the true population parameter is not 0. We
cannot reject the null hypothesis of no relationship, and we cannot confirm our hypothesis (the alternative
hypothesis). In sum, when you are looking at the parameter estimates and standard errors in a regression
model, and twice a standard error is larger than the parameter estimate itself, the probability of no
relationship is high (p-value is greater than 0.05).
Ok, now let’s tackle the margin of error. Some of you that look at a lot of public opinion polls may
already be quite familiar with it. Sampling error, or the margin of error, is random error that occurs when
we sample from a population. Remember, the law of large numbers tells us that the larger our sample, the
closer we are to sampling the entire population, and the closer our sample parameters will be to the true
population parameters. So, the larger the sample, the smaller the margin of error.
Margin of error is extremely relevant for surveys because it tells us how close a sample parameter is to
the true population parameter. For example, if we have a survey question that asks whether people agree
or disagree with some public policy (so a dummy measurement with only agree and disagree as the
choices). We find that 53% of people agree with the policy. This seems like it might be a majority, right?
But we cannot say that until we look at the margin of error. We use the equation and find that because the
sample size was so small (500 people), the margin of error is about 4.4%. This means that the percentage
of people in the population that agree with the policy ranges between 48.6% and 57.6%. That means we
cannot actually say that there is majority agreement (because the range of possible values does not remain
above 50%).
With that, I am concluding the video on sampling.