Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission Announcements • Problem Set #2 due today! Review: Populations • Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15) • Beyond literal definition, a population is the general group that we wish to study and gain insight into • Sample: A subset of a population • Random Sample: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77) • Randomness is one strategy to avoid biased samples. Review: Statistical Inference • Statistical inference: making statistical generalizations about a population from evidence contained in a sample (Knoke, 77) • When is statistical inference likely to work? • 1. When a sample is large • If a sample approaches the size of the population, it is likely be a good reflection of that population • 2. When a sample is representative of the entire population • As opposed to a sample that is atypical in some way, and thus not reflective of the larger group. Populations and Samples • Population parameters (μ, σ) are constants • There is one true value, but it is usually unknown • Sample statistics (Y-bar, s) are variables • Up until now we’ve treated them as constants • But, there are many possible samples • The value of mean, S.D. vary depending on which sample you have • Like any variable, the mean and S.D. have a distribution • Called the “sampling distribution” • Made up of all values for any given population Populations and Samples: Overview Characteristics Characteristics are: Notation Estimate Population Sample “parameters” “statistics” constant (one variables (varies for population) for each sample) Roman ( Y , s) Greek (, ) “hat”: σ̂ “point estimate” based on sample Population and Sample Distributions Y s Estimating the Mean • Suppose we want to know the mean of a population (μ). What do we do? • Plan A: Spend $100 million dollars to survey our entire population • If it is even possible to survey the whole population • Plan B: Spend $1,000 sampling a few hundred people. • Estimate the mean • Simply use formulas to estimate mu: μ̂ Estimating the Mean • Question: Given our sample, what is our best guess of the population mean? • Answer: The sample mean: Y-bar • Look at Y-bar, assume that it is a “good guess” • Thus, we calculate: 1 μ̂ Y N N Y i 1 i Estimating the Mean • Issue: There are an infinite number of possible samples that one can take from any population – Each possible sample has a mean, most of which are different • Some are close to the population mean, some not • Q: How do we know if we got a “good guess”? • A: We can’t know for sure. We may draw incorrect conclusions about the mean • But: We can use probability theory to determine if our guess is likely to be good! Estimates and Sampling Distributions • It is possible to take more than one sample • And calculate more than one estimate of the mean • If we took many samples (and calculated many means), we’d see a range of estimates • We could even plot a histogram of the many estimates • Our confidence in our guess depends on how “spread out” the range of guesses tends to be • The “standard deviation” of that particular histogram. Sampling Distributions • Sampling Distribution: The distribution of estimates created by taking all possible unique samples (of a fixed size) from a population • Example: Take every possible 10-person sample of sociology graduate students (all combinations) • 1. Calculate the mean of each sample • 2. Graph a histogram of all estimates • This is called “the sampling distribution of the mean” • Note: The sampling distribution is rarely known • It is typically thought of as a probability distribution. Sampling Distribution Notation • Population mean and S.D. are: , • Each sample has a mean and S.D.: Y-bar, s • The sampling distribution of the mean (i.e., the distribution of mean-estimates) also has a mean • And a S.D., aka the “standard error” • Mean, S.D. of sampling distribution: μ Y σY • Question: Why are they Greek? • A:Because all possible samples represent a population • Question: Why is there a sub-Y-bar? • Because it is the mean of all possible Y-bars (means) Sampling Distribution of the Mean • It turns out that under some circumstances, the shape of the sampling distribution of the mean can be determined – Thus allowing one to get a sense of the range of estimates of the mean one is likely to see • If distribution is narrow, our guess is probably good! • If S.D. is large, our guess may be quite bad • This provides insight into the probable location of the population mean • Even if you only have one single sample to look at • This “trick” lets us draw conclusions!!! Sampling Distribution Example • Let’s create a sampling distribution from a small population, = 52. (Sample N = 3) Case # of CDs 1 30 2 100 3 20 4 70 5 40 • Note how the mean varies depending on the sample • Mean of cases 1,2,3 = 50 • Mean of 2,4,5 = 70 • For this population (N=5) we can calculate all possible means based on sample size 3 Sampling Distribution Example • First, we must calculate every possible mean Case # of CDs 1 30 2 100 3 20 4 70 5 40 • • • • • • • • • • 1,2,3 = 50 1,2,4 = 66.67 1,2,5 = 56.67 1,3,4 = 40 1,3,5 = 30 1,4,5 = 46.67 2,3,4 = 63.33 2,3,5 = 53.33 2,4,5 = 70 3,4,5 = 43.33 Sampling Distribution Example Sample 1 2 3 4 5 6 7 8 9 10 Y-bar 50 66.67 56.67 40 30 46.67 63.33 53.33 70 43.33 • Here, you can see how the sample mean is really a variable • This complete list of all possible means is the sampling distribution • As a probability distribution, this tells us the probability of picking a sample with each mean • Note: Sampling Dist mean = 52 • Same as population mean! Sampling Distribution Example • Histogram of Sampling Distribution (N=3): • Note: The distribution 4 = 52 centers around 3 the population mean 2 • And, it is roughly 1 symmetrical 0 17-27 27-37 37-47 47-57 57-67 67-77 77-87 Sampling Distribution Example • As a probability distribution, the sampling distribution gives a sense of the quality of our estimate of The probability of Probability = Frequency / N .4 .3 .2 = 52 picking a sample with a mean that is within +/- 5 of is p = .3 (30%) The probability of overestimating by more than 15 is about p = .1 (10%) .1 0 17-27 27-37 37-47 47-57 57-67 67-77 77-87 Q: What is the probability of a “poor” estimate of ? Sampling Distribution Example • Note: If the sampling distribution is narrow, most of our estimates of the mean will be good • That is, they will be close to , the population mean • If the sampling distribution is wide, the probability of a “bad” estimate goes up • A measure of dispersion can help us assess the sampling distribution • Recall: the standard deviation of a sampling distribution is called: the standard error • It tells us the width of the sampling distribution! The Central Limit Theorem • But, how do we know the width of the sampling distribution? • Statisticians have shown that the sampling distribution will have consistent properties, if we have a large sample • Several of these properties constitute the “Central Limit Theorem” • These properties provide the basis for drawing statistical inferences about the mean. The Central Limit Theorem • If you have a large sample (Large N): • 1. The sampling distribution of the mean (and thus all possible estimates of the mean) cluster around the true population mean • 2. They cluster as a normal curve • Even if the population distribution is not normal • 3. The estimates are dispersed around the population mean by a knowable standard deviation (sigma over root N) The Central Limit Theorem • Formally stated: 1. As N grows large, the sampling distribution of the mean approaches normality 2. μ Y μ Y σY 3. σ Y N Central Limit Theorem: Visually Y s σY μY Implications of the C.L.T • What does this mean for us? • Typically, we only have one sample, and thus only one estimate of • The actual value of is unknown • So we don’t know the center of the sampling distribution • All we know for certain is that our estimate falls somewhere in the sampling distribution • This is always true by definition • And, later, we’ll estimate its width. Implications of the C.L.T • Visually: Suppose we observe mu-hat = 16 There are many possible locations of μ̂ 16 μ μ̂ 16 μ Sampling distribution But, mu-hat always falls within the sampling distribution μ μ̂ 16 μ̂ 16 μ Implications of the C.L.T • We know that the mean from our sample falls somewhere in this sampling distribution • Which has mean , standard deviation over square root N • If we can estimate , we can estimate sigma over root N... The “Standard Error” of the mean • We don’t know exactly where the sample falls • But, laws of probability suggest that we are most likely to draw a sample w/mean from near the center • Recall: 67% fall +/- 1 SD, 95 +/- 2SD in a normal curve • So, we can determine the range around in which 95% (or 99%, or 99.9%) of cases will fall. Implications of the C.L.T • What is the relation between the Standard Error and the size of our sample (N)? • Answer: It is an inverse relationship. • The standard deviation of the sampling distribution shrinks as N gets larger • Formula: σY σY N • Conclusion: Estimates of the mean based on larger samples tend to cluster closer around the true population mean. Implications of the CLT • Visually: The width of the sampling distribution is an inverse function of N (sample size) – The distribution of mean estimates based on N = 10 will be more dispersed. Mean estimates based on N = 50 will cluster closer to . μ̂ μ Smaller sample size μ̂ μ Larger sample size Confidence Intervals • Benefits of knowing the width of the sampling distribution: • 1. You can figure out the general range of error that a given point estimate might miss by • based on the range around the true mean that the estimates will fall • 2. And, this defines the range around an estimate that is likely to hold the population mean • A “confidence interval” • Note: These only work if N is large! Confidence Interval • Confidence Interval: “A range of values around a point estimate that makes it possible to state the probability that an interval contains the population parameter between its lower and upper bounds.” (Bohrnstedt & Knoke p. 90) • It involves a range and a probability • Examples: • We are 95% confident that the mean number of CDs owned by grad students is between 20 and 45 • We are 50% confident the mean rainfall this year will be between 12 and 22 inches. Confidence Interval • Visually: It is probable that falls near mu-hat μ̂ 16 μ Range where is unlikely to be Probable values of μ μ Q: Can be this far from mu-hat? Answer: Yes, but it is very improbable Confidence Interval • To figure out the range in of “error” in our mean estimate, we need to know the width of the sampling distribution – The Standard Error! (The S.D. of this distribution) • The Central Limit Theorem provides a formula: σY σY N • Problem: We do not know the exact value of sigma-sub-Y, the population standard deviation! Confidence Interval • Question: How do we calculate the standard error if we don’t know the population S.D.? • Answer: We estimate it using the information we have • Formula for best estimate: sY σ̂ Y N • Where N is the sample size and s-sub-Y is the sample standard deviation 95% Confidence Interval Example • Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200 • How do we find the 95% Confidence Interval? • If N is large, we know that: • 1. The sampling distribution is roughly normal • 2. Therefore 95% of samples will yield a mean estimate within 2 standard deviations (of the sampling distribution) of the population mean () • Thus, 95% of the time, our estimates of (Y-bar) are within two “standard errors” of the actual value of . 95% Confidence Interval • Formula for 95% confidence interval: 95% CI : Y 2(σY ) • Where Y-bar is the mean estimate and sigma (Ybar) is the standard error • Result: Two values – an upper and lower bound • Adding our estimate of the standard error: sY Y 2(σ̂Y ) Y 2 N 95% Confidence Interval • Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200 s • Calculate: 95% CI : Y 2( ) N 200 10 200 1020 (2)( ) 1020 2( 100 1020 2(20) 1020 40 • Thus, we are 95% confident that the population mean falls between 980 and 1060. )