Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sociology 5811: Lecture 8: CLT Applications: Confidence Intervals, Examples Copyright © 2005 by Evan Schofer Do not copy or distribute without permission Announcements • Problem Set 3 handed out • On course website Review: Sampling Distributions • Q: What is the sampling distribution of the mean? • Answer: Sampling Distribution: The distribution of estimates created by taking all possible unique samples (of a fixed size) from a population • Q: What is the Standard Error? • Answer: The standard deviation of the sampling distribution • Q: What does the Standard Error tell you? • Answer: How “dispersed” estimates will be around the true parameter value Review: Central Limit Theorem • Q: What does the CLT mean in plain language? 1. As N grows large, the sampling distribution of the mean approaches normality 2. μ Y μ Y σY 3. σ Y N Central Limit Theorem: Visually Y s σY μY Implications of the C.L.T • Visually: Suppose we observe mu-hat = 16 There are many possible locations of μ̂ 16 μ μ̂ 16 μ Sampling distribution But, mu-hat always falls within the sampling distribution μ μ̂ 16 μ̂ 16 μ Implications of the C.L.T • What is the relation between the Standard Error and the size of our sample (N)? • Answer: It is an inverse relationship. • The standard deviation of the sampling distribution shrinks as N gets larger • Formula: σY σY N • Conclusion: Estimates of the mean based on larger samples tend to cluster closer around the true population mean. Implications of the CLT • The width of the sampling distribution is an inverse function of N (sample size) – The distribution of mean estimates based on N = 10 will be more dispersed. Mean estimates based on N = 50 will cluster closer to . μ̂ μ Smaller sample size μ̂ μ Larger sample size Confidence Intervals • Benefits of knowing the width of the sampling distribution: • 1. You can figure out the general range of error that a given point estimate might miss by • Based on the range around the true mean that the estimates will fall • 2. And, this defines the range around an estimate that is likely to hold the population mean • A “confidence interval” • Note: These only work if N is large! Confidence Interval • Confidence Interval: “A range of values around a point estimate that makes it possible to state the probability that an interval contains the population parameter between its lower and upper bounds.” (Bohrnstedt & Knoke p. 90) • It involves a range and a probability • Examples: • We are 95% confident that the mean number of CDs owned by grad students is between 20 and 45 • We are 50% confident the mean rainfall this year will be between 12 and 22 inches. Confidence Interval • Visually: It is probable that falls near mu-hat μ̂ 16 μ Range where is unlikely to be Probable values of μ μ Q: Can be this far from mu-hat? Answer: Yes, but it is very improbable Confidence Interval • To figure out the range in of “error” in our mean estimate, we need to know the width of the sampling distribution • The Standard Error! (S.D. of the sampling dist of the mean) • The Central Limit Theorem provides a formula: σY σY N • Problem: We do not know the exact value of sigma-sub-Y, the population standard deviation! Confidence Interval • Question: How do we calculate the standard error if we don’t know the population S.D.? • Answer: We estimate it using the information we have: sY σ̂ Y N • Where N is the sample size and s-sub-Y is the sample standard deviation. 95% Confidence Interval Example • Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200 • How do we find the 95% Confidence Interval? • If N is large, we know that: • 1. The sampling distribution is roughly normal • 2. Therefore 95% of samples will yield a mean estimate within 2 standard deviations (of the sampling distribution) of the population mean () • Thus, 95% of the time, our estimates of (Y-bar) are within two “standard errors” of the actual value of . 95% Confidence Interval • Formula for 95% confidence interval: 95% CI : Y 2(σY ) • Where Y-bar is the mean estimate and sigma (Ybar) is the standard error • Result: Two values – an upper and lower bound • Adding our estimate of the standard error: sY Y 2(σ̂Y ) Y 2 N 95% Confidence Interval • Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200 s • Calculate: 95% CI : Y 2( ) N 200 10 200 1020 (2)( ) 1020 2( 100 1020 2(20) 1020 40 • Thus, we are 95% confident that the population mean falls between 980 and 1060 ) Confidence Intervals • Question: Suppose we want to know the confidence interval for a value other than 95%? • How can we find the C.I. For any number? • Answer #1: We know that 68% of cases fall within 1 standard deviation, 99% within 3 • Q: What is 99% C.I.? (Y-bar = 1020, S.D. = 200) 99%CI : Y 3( ) 200 1020 (3)( ) 960 to 1080 100 s N Confidence Intervals • Question: Which was a larger range: the 95% CI or 99% CI ? • Answer: The 99% range was larger • The larger the range, the more likely that the true mean will fall in it • It is a safe bet if you specify a very wide range • If you want to bet that the mean will fall in a very narrow range, you’ll lose more often. Confidence Intervals • Question: Suppose we want to know the confidence interval for a value other than 95%? • Answer #2: Look at the “Z-table” • Z-table = Normal curve probability distribution with mean 0, SD of 1 • Found on Knoke, p. 459 – It tells you the % of cases falling within a particular number of S.D.’s of the mean • Lists all values, not just 1, 2, and 3! Confidence Intervals: Z-table Question: What Z-value should we use for 20% confidence interval? Answer: 10% fall from 0 to Z=.26. 20% of cases fall from -.26 to +.26 Confidence Intervals • General formula for Confidence Interval: C.I. : Y Zα/2 (σ Y ) • Where: – Y-bar is the sample mean – Sigma sub-Y-bar is the standard error of mean – Z sub a/2 is the Z-value for level of confidence – It can be looked up in a Z-table – If you want 90%, look up p(0 to Z) of .45 Small N Confidence Intervals • If N is large, the C.L.T. assures us that that the sampling distribution is normal • This allows us to construct confidence intervals • Issue: What if N is not large? • The sampling distribution may not be normal • Z-distribution probabilities don’t apply… • In short: If N is small our confidence interval formula based on Z-scores doesn’t work. Small N Confidence Intervals • Solution: Find another curve that accurately characterizes sampling distribution for small N • The “T-distribution” • An alternative that accurately approximates the shape of the sampling distribution for small N • The T distribution actually a set of distributions with known probabilities • Again, we can look up values in a table to determine probabilities associated with a # of standard deviations from the mean. Confidence Intervals for Small N • Small N C. I. Formula: • Yields accurate results, even if N is not large C.I. : Y t α/2 (σ̂ Y ) • Again, the standard error can be estimated by the sample standard deviation: s C.I. : Y t α/2 N T-Distributions • Issue: Which T-distribution do you use? • The T-distribution is a “family” of distributions • In a T-Distribution table, you’ll find many T-distributions to choose from • One t-distribution for each “degree of freedom” – Also called “df” or “DofF” • Which T-distribution should you use? • For confidence intervals: Use T-distribution for df = N - 1 • Ex: If N = 15, then look at T-distribution for df = 14. Looking Up T-Tables Choose the desired probability for a/2 Find t-value in correct row and column Choose the correct df (N-1) Interpretation is just like a Z-score. 2.145 = number of standard errors for C.I.! Uses of Confidence Intervals • What are some uses for confidence intervals? • 1. Assessing the general quality of an estimate – Ex: Mean level of happiness of graduate students • Happiness scored on a measure from 1-10 (10=most) – Suppose 95% is: 6 +/- 4 • i.e., range = 2 to 10 – Question: Is this a “good” estimate? – Answer: No, it is not very useful. • Something like 6 +/- 1 is a more useful estimate. Uses of Confidence Intervals • 2. Comparing a mean estimate to a specific value • Ex: Comparing a school’s test scores to a national standard • Suppose national standard on a math test is 47 • Suppose a sample of students scores 52. Did the school population meet the national standard? • If 99% CI is 50-54, then the answer is probably yes – If 99% CI is 42-62, it isn’t certain. • Ex: A factory makes bolts that must hold 10 kilos • Confidence intervals let you verify that the bolts are strong enough, without testing each one. Uses of the Sampling Distribution • Extended example: • Let’s figure out what the sampling distribution looks like for a specific population • Since the sampling distribution is a probability distribution…. • We can then calculate the probability of observing any particular value of Y-bar (given a known ) • Note: Later we’ll use the converse logic to draw conclusions about the actual value of , given an observed Y-bar. Probability of Y-bar, given • Suppose we have a population with the following characteristics: • = 23, = 9 • What is the probability of picking a sample (N=35) that has a mean of 27 or more? • To determine this, we must first determine the shape of the sampling distribution • Then we can determine the probability of falling a given distance from it… Probability of Y-bar, given • Q: According to the Central Limit Theorem, what is the mean of the sampling distribution? • A: Same as the population: μ Y μ 23 • Second, we must determine the “width” of the sampling distribution: the standard deviation (referred to as Standard Error) • The C.L.T says we can calculate it as: σY 9 9 σY 1.52 N 35 5.9 Probability of Y-bar, given • If we know and the Standard Error, we can draw the sampling distribution of the mean for this population: μ Y 23, σ Y 1.5 19 20 21 22 23 24 25 26 27 Probability of Y-bar, given • We know that 95% of possible Y-bars fall within two Standard Errors (i.e., +/- 3): – between 20 and 26 μ Y 23, σ Y 1.5 19 20 21 22 23 24 25 26 27 Probability of Y-bar, given • To determine the probability associated with a particular value, convert to Z-scores • p(-1<Z<1) is.68, p(-2<Z<2) is.95, etc • We use a slightly different Z-score formula than we learned before • But it is analogous (Yi Y ) (Y μ) Zi sY σY Probability of Y-bar, given • Why use a different formula for Z-scores? • Old formula calculates # standard deviations a case falls from the sample mean • From Y-sub-i to Y-bar • New formula tells the number of standard errors a mean estimate falls from the population mean • From Y-bar to mu (Yi Y ) (Y μ) Zi sY σY Probability of Y-bar, given • Back to the problem: What is the Z-score associated with getting a sample mean of 27 or greater from this population? • Sampling distribution mean = 23 • Standard error = 1.5 (Y μ) 27 23 Z 2.66 σY 1.5 Probability of Y-bar, given • Finally, what is the probability of observing a Zscore of 2.66 (or greater) in a standard normal distribution? • To convert Z-scores to probabilities, look it up in a table, such as Knoke p. 463 • Area beyond Z=2.66 is .0039 • How do we interpret that? • Lets look at it visually: Probability of Y-bar, given • The Z-distribution is a probability distribution – Total area under curve = 1.0 – Area under half curve is .5 – Red are (“Area beyond Z”) = .0039 Probability of Y-bar, given Is the probability of Z > 2.66 very large? No! Red area = probability of Z > 2.66 = .004, which is .4% -3 -2 -1 0 1 2 3 Probability of Y-bar, given • Conclusion: Y-bar of 27 (or larger) should occur only 4 out of 1000 times we sample from this population • Possible interpretations: • 1. We just experienced an improbable sample • 2. Our sample was biased, not representative • 3. Maybe we begin to suspect that the population mean () isn’t really 23 after all… • Idea: We could “cast doubt on” someone’s claim that = 23, given this observed Y-bar and S.D. • Hypothesis testing is based on this! Conclusions About Means • The previous example started out with the assumption that = 23 – Typically, will be unknown; Only Y-bar is known – But, the same logic can be applied to “test” whether is likely to equal 23 • If observed Y-bar is highly unlikely, we cast doubt on the idea that is really 23 – Example: We can “test” whether a school’s math scores are above national standard of 47 • If school sample is far above national average, it is improbable that the school population is at or below 47 • Next Class: Hypothesis testing!