Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAT 101 Dr. Kari Lock Morgan Normal Distribution Chapter 5 • Normal distribution • Central limit theorem • Normal distribution for confidence intervals • Normal distribution for p-values • Standard normal Statistics: Unlocking the Power of Data Lock5 Re-grade Requests 4e potential grading mistake: 0.025 is correct Requests for a re-grade must be submitted in writing by class on Wednesday, March 5th Partial credit will NOT be adjusted Valid re-grade requests: You got points off but believe your answer is correct Points were added incorrectly Warning: scores may go up or down Statistics: Unlocking the Power of Data Lock5 Bootstrap and Randomization Distributions Correlation: Malevolent uniforms Measures from Scrambled Collection 1 Slope :Restaurant tips Measures from Scrambled RestaurantTips -60 -40 Dot Plot -20 0 20 slope (thousandths) Mean :Body Temperatures Measures from Sample of BodyTemp50 98.2 98.3 98.4 40 -0.4 -0.2 0.0 r 0.2 0.4 What do you Diff means: Finger taps notice? 0.6 Dot Plot Measures from Scrambled CaffeineTaps 98.5 98.6 Nullxbar 98.7 98.8 0.5 phat 0.6 98.9 Dot Plot Dot Plot 99.0 -4 Proportion : Owners/dogs 0.4 60 -0.6 Measures from Sample of Collection 1 0.3 Dot Plot -3 -2 -1 0 Diff 1 2 3 Mean : Atlanta commutes Measures from Sample of CommuteAtlanta 0.7 0.8 Statistics: Unlocking the Power of Data 26 27 28 29 xbar 30 4 Dot Plot 31 32 Lock5 Normal Distribution • The symmetric, bell-shaped curve we have 1000 0 500 Frequency 1500 seen for almost all of our bootstrap and randomization distributions is called a normal distribution -3 Statistics: Unlocking the Power of Data -2 -1 0 1 2 3 Lock5 Central Limit Theorem! For a sufficiently large sample size, the distribution of sample statistics for a mean or a proportion is normal www.lock5stat.com/StatKey Statistics: Unlocking the Power of Data Lock5 Distribution of 𝒑 n 1 n 10 n 30 n 50 n 100 p 0.5 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 p 0.7 0.0 p 0.1 Statistics: Unlocking the Power of Data 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 Lock5 CLT for a Mean Population 8 3.0 1.5 0 1 2 10 x n = 30 2.0 3.0 2 3 4 5 1.5 2.0 2.5 3.0 25 1 0 2 4 Statistics: Unlocking the Power of Data 1.0 0 10 Frequency 0 n = 50 3 4 5 6 8 6 4 4 0 2 Frequency 0 Distribution of Sample Means 0.0 n = 10 Frequency Distribution of Sample Data 6 8 12 1.4 1.8 2.2 2.6 Lock5 Central Limit Theorem • The central limit theorem holds for ANY original distribution, although “sufficiently large sample size” varies • The more skewed the original distribution is (the farther from normal), the larger the sample size has to be for the CLT to work • For small samples, it is more important that the data itself is approximately normal Statistics: Unlocking the Power of Data Lock5 Central Limit Theorem • For distributions of a quantitative variable that are not very skewed and without large outliers, n ≥ 30 is usually sufficient to use the CLT • For distributions of a categorical variable, counts of at least 10 within each category is usually sufficient to use the CLT Statistics: Unlocking the Power of Data Lock5 Accuracy • The accuracy of intervals and p-values generated using simulation methods (bootstrapping and randomization) depends on the number of simulations (more simulations = more accurate) • The accuracy of intervals and p-values generated using formulas and the normal distribution depends on the sample size (larger sample size = more accurate) • If the distribution of the statistic is truly normal and you have generated many simulated randomizations, the p-values should be very close Statistics: Unlocking the Power of Data Lock5 Normal Distribution • The normal distribution is fully characterized by it’s mean and standard deviation N mean,standard deviation Statistics: Unlocking the Power of Data Lock5 Bootstrap Distributions If a bootstrap distribution is approximately normally distributed, we can write it as a) b) c) d) N(parameter, sd) N(statistic, sd) N(parameter, se) N(statistic, se) sd = standard deviation of variable se = standard error = standard deviation of statistic Statistics: Unlocking the Power of Data Lock5 Hearing Loss • In a random sample of 1771 Americans aged 12 to 19, 19.5% had some hearing loss (this is a dramatic increase from a decade ago!) • What proportion of Americans aged 12 to 19 have some hearing loss? Give a 95% CI. Rabin, R. “Childhood: Hearing Loss Grows Among Teenagers,” www.nytimes.com, 8/23/10. Statistics: Unlocking the Power of Data Lock5 Hearing Loss (0.177, 0.214) Statistics: Unlocking the Power of Data Lock5 Hearing Loss N(0.195, 0.0095) Statistics: Unlocking the Power of Data Lock5 Confidence Intervals If the bootstrap distribution is normal: To find a P% confidence interval , we just need to find the middle P% of the distribution N(statistic, SE) Statistics: Unlocking the Power of Data Lock5 Area under a Curve • The area under the curve of a normal distribution is equal to the proportion of the distribution falling within that range • Knowing just the mean and standard deviation of a normal distribution allows you to calculate areas in the tails and percentiles www.lock5stat.com/statkey Statistics: Unlocking the Power of Data Lock5 Hearing Loss www.lock5stat.com/statkey (0.176, 0.214) Statistics: Unlocking the Power of Data Lock5 Standardized Data Often, we standardize the data to have mean 0 and standard deviation 1 This is done with z-scores From x to z : x mean z sd From z to x: x = mean + z ´ sd Places everything on a common scale Statistics: Unlocking the Power of Data Lock5 Standard Normal • The standard normal distribution is the normal distribution with mean 0 and standard deviation 1 of Statistic Assuming Null Distribution N 0,1 -3 -2 -1 0 1 2 3 Statistic Statistics: Unlocking the Power of Data Lock5 Standardized Data Confidence Interval (bootstrap distribution): mean = sample statistic, sd = SE From z to x: (CI) x = mean + z ´ sd x statistic z SE Statistics: Unlocking the Power of Data Lock5 P% Confidence Interval 1. Find z-scores (–z* and z*) that capture the middle P% of the standard normal 2. Return to original scale with statistic z* SE P% -z* Statistics: Unlocking the Power of Data z* Lock5 Confidence Interval using N(0,1) If a statistic is normally distributed, we find a confidence interval for the parameter using statistic z* SE where the area between –z* and +z* in the standard normal distribution is the desired level of confidence. Statistics: Unlocking the Power of Data Lock5 Confidence Intervals Find z* for a 99% confidence interval. www.lock5stat.com/statkey z* = 2.575 Statistics: Unlocking the Power of Data Lock5 z* Why use the standard normal? Common confidence levels: 95%: z* = 1.96 (but 2 is close enough) 90%: z* = 1.645 99%: z* = 2.576 Statistics: Unlocking the Power of Data Lock5 Sin Taxes In March 2011, a random sample of 1000 US adults were asked “Do you favor or oppose ‘sin taxes’ on soda and junk food?” 320 adults responded in favor of sin taxes. Give a 99% CI for the proportion of all US adults that favor these sin taxes. From a bootstrap distribution, we find SE = 0.015 Statistics: Unlocking the Power of Data Lock5 Sin Taxes Statistics: Unlocking the Power of Data Lock5 Sin Taxes Statistics: Unlocking the Power of Data Lock5 Randomization Distributions If a randomization distribution is approximately normally distributed, we can write it as a) N(null value, se) b) N(statistic, se) c) N(parameter, se) Statistics: Unlocking the Power of Data Lock5 p-values If the randomization distribution is normal: To calculate a p-value, we just need to find the area in the appropriate tail(s) beyond the observed statistic of the distribution Statistics: Unlocking the Power of Data Lock5 First Born Children • Are first born children actually smarter? • Explanatory variable: first born or not • Response variable: combined SAT score • Based on a sample of college students, we find 𝑥𝑓𝑖𝑟𝑠𝑡 𝑏𝑜𝑟𝑛 − 𝑥𝑛𝑜𝑡 𝑓𝑖𝑟𝑠𝑡 𝑏𝑜𝑟𝑛 = 30.26 • From a randomization distribution, we find SE = 37 Statistics: Unlocking the Power of Data Lock5 First Born Children 𝑥𝑓𝑖𝑟𝑠𝑡 𝑏𝑜𝑟𝑛 − 𝑥𝑛𝑜𝑡 𝑓𝑖𝑟𝑠𝑡 𝑏𝑜𝑟𝑛 = 30.26 SE = 37 What normal distribution should we use to find the p-value? a) b) c) d) N(30.26, 37) N(37, 30.26) N(0, 37) N(0, 30.26) Statistics: Unlocking the Power of Data Lock5 Hypothesis Testing Distribution of Statistic Assuming Null Observed Statistic p-value -3 -2 -1 0 1 2 3 Statistic Statistics: Unlocking the Power of Data Lock5 First Born Children N(0, 37) www.lock5stat.com/statkey p-value = 0.207 Statistics: Unlocking the Power of Data Lock5 Standardized Data Hypothesis test (randomization distribution): mean = null value, sd = SE From x to z (test) : x mean z sd x - null z= SE Statistics: Unlocking the Power of Data Lock5 p-value using N(0,1) If a statistic is normally distributed under H0, the p-value is the probability a standard normal is beyond 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 − 𝑛𝑢𝑙𝑙 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟 𝑧= 𝑆𝐸 Statistics: Unlocking the Power of Data Lock5 First Born Children 𝑥𝑓𝑖𝑟𝑠𝑡 𝑏𝑜𝑟𝑛 − 𝑥𝑛𝑜𝑡 𝑓𝑖𝑟𝑠𝑡 𝑏𝑜𝑟𝑛 = 30.26, SE = 37 1) Find the standardized test statistic 2) Compute the p-value Statistics: Unlocking the Power of Data Lock5 First Born Children Statistics: Unlocking the Power of Data Lock5 z-statistic If z = –3, using = 0.05 we would (a) Reject the null (b) Not reject the null (c) Impossible to tell (d) I have no idea Statistics: Unlocking the Power of Data Lock5 z-statistic • Calculating the number of standard errors a statistic is from the null value allows us to assess extremity on a common scale Statistics: Unlocking the Power of Data Lock5 Confidence Interval Formula IF SAMPLE SIZES ARE LARGE… From N(0,1) sample statistic z SE * From original data Statistics: Unlocking the Power of Data From bootstrap distribution Lock5 Formula for p-values IF SAMPLE SIZES ARE LARGE… From original data From H0 sample statistic null value z SE From randomization distribution Statistics: Unlocking the Power of Data Compare z to N(0,1) for p-value Lock5 Standard Error • Wouldn’t it be nice if we could compute the standard error without doing thousands of simulations? • We can!!! • Or at least we’ll be able to next class… Statistics: Unlocking the Power of Data Lock5 t-distribution • For quantitative data, we use a tdistribution instead of the normal distribution •The t distribution is very similar to the standard normal, but with slightly fatter tails (to reflect the uncertainty in the sample standard deviations) Statistics: Unlocking the Power of Data Lock5 Degrees of Freedom • The t-distribution is characterized by its degrees of freedom (df) • Degrees of freedom are based on sample size • Single mean: df = n – 1 • Difference in means: df = min(n1, n2) – 1 • Correlation: df = n – 2 • The higher the degrees of freedom, the closer the t-distribution is to the standard normal Statistics: Unlocking the Power of Data Lock5 t-distribution Statistics: Unlocking the Power of Data Lock5 Aside: William Sealy Gosset Statistics: Unlocking the Power of Data Lock5 The Pygmalion Effect Teachers were told that certain children (chosen randomly) were expected to be intellectual “growth spurters,” based on the Harvard Test of Inflected Acquisition (a test that didn’t actually exist). These children were selected randomly. The response variable is change in IQ over the course of one year. Source: Rosenthal, R. and Jacobsen, L. (1968). “Pygmalion in the Classroom: Teacher Expectation and Pupils’ Intellectual Development.” Holt, Rinehart and Winston, Inc. Statistics: Unlocking the Power of Data Lock5 The Pygmalion Effect Control Students “Growth Spurters” n 255 65 X 8.42 12.22 s 12.0 13.3 Can this provide evidence that merely expecting a child to do well actually causes the child to do better? If so, how much better? SE = 1.8 *s1 and s2 were not given, so I set them to give the correct p-value Statistics: Unlocking the Power of Data Lock5 Pygmalion Effect Statistics: Unlocking the Power of Data Lock5 Pygmalion Effect From the paper: Statistics: Unlocking the Power of Data “The difference in gains could be ascribed to chance about 2 in 100 times” Lock5 Pygmalion Effect Statistics: Unlocking the Power of Data Lock5 To Do Do Project 1 (due 3/7) Read Chapter 5 Statistics: Unlocking the Power of Data Lock5