Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stat 112 Notes 4 • Today: – Review of p-values for one-sided tests – Chapter 3.4.2 (Assessing the fit of the regression line). – Chapter 3.5.2 (Prediction Intervals) – Chapter 3.7 (Some Cautions in Interpreting Regression Results) • Homework 1 due on Thursday • For Thursday’s office hours, for this week only, I will hold them from 1-2 instead of after class (I have my usual office hours today after class). p-values for one-sided tests example: Poverty Rates and Doctors Bivariate Fit of MDs per 100,000 By Poverty Percent 450 MDs per 100,000 400 350 300 250 200 150 7.5 10 12.5 15 17.5 20 22.5 Poverty Percent Parameter Estimates Term Intercept Poverty Percent Estimate 286.84208 -4.329299 Std Error 33.14046 2.669525 t Ratio 8.66 -1.62 Prob>|t| <.0001 0.1114 Example: One Sided Test Do there tend to be less doctors in states with higher poverty rates? Let Y =MDs per 100,000 X =Poverty Percent Simple Linear Regression Model: E (Y | X ) 0 1 X H 0 : 1 0 H a : 1 0 Because the t-ratio is negative and is on the same side as alternative, the p-value is (Prob>|t|)/2 = 0.1114/2 = .0557. Suggestive but inconclusive evidence that there tend to be less doctors in states with higher poverty rates. Example Continued: One and Two Sided Tests Do there tend to be more doctors in states with higher poverty rates? H 0 : 1 0 H a : 1 0 Because the t-ratio is negative and on the opposite side of the alternative, the p-value is 1-(Prob>|t|)/2=1-0.1114/2=.9443 Is poverty rate associated with the number of doctors in a state? p-value = Prob>|t|/2 = 0.1114. There is not strong evidence that poverty rate is associated with the number of doctors in a state. Teachers’ Salaries and Dating • In U.S. culture, it is usually considered impolite to ask how much money a person makes. • However, suppose that you are single and are interested in dating a particular person. • Of course, salary isn’t the most important factor when considering whom to date but it certainly is nice to know (especially if it is high!) • In this case, the person you are interested in happens to be a high school teacher, so you know a high salary isn’t an issue. • Still you would like to know how much she or he makes, so you take an informal survey of 11 high school teachers that you know. Distributions Salary 35000 50000 60000 Moments Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 50881.818 6491.1968 1957.1695 55242.664 46520.973 11 Based on this data, what can you conclude? Absent any other information, best guess for teacher’s salary is the mean salary, $50,882. But it is likely that this estimate will not be correct. To get an idea of how far off, you might be, you can calculate the standard deviation: 11 s (y i 1 i y) 2 n 1 421437378 6491.82 10 The standard deviation is the “typical” amount by which an observation deviates from mean. Thus, your best estimate for your potential date’s salary is $50,882 but a typical estimate will be off by about $6,500. • You happen to know that the person you are interested in has been teaching for 8 years. • How can you use this information to better predict your potential date’s salary? • Regression Analysis to the Rescue! • You go back to each of the original 11 teachers you surveyed and ask them for their years of experience. • Simple Linear Regression Model: E(Y|X)= 0 1 X , the distribution of Y given X is normal with mean 0 1 X and standard deviation e . Bivariate Fit of Salary By Years of Experience 65000 Salary 60000 55000 50000 45000 40000 35000 0 2.5 5 7.5 10 12.5 Years of Experience Bivariate Fit of Salary By Years of Experience 65000 Salary 60000 55000 50000 45000 40000 35000 0 2.5 5 7.5 10 12.5 Years of Experience Linear Fit Linear Fit Salary = 40612.135 + 1686.0674 Years of Experience Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.545881 0.495423 4610.93 50881.82 11 Linear Fit Linear Fit Salary = 40612.135 + 1686.0674 Years of Experience Summary of Fit RSquare RSquare Adj Root Mean Square Error 0.545881 0.495423 4610.93 • Predicted salary of your potential date who has been a teacher for 8 years = Estimated Mean salary for teachers of 8 years = 40612.135+1686.0674*8 = $54,100 • How far off will your estimate typically be? Root mean square error = Estimated standard deviation of Y|X = $4,610.93. • Notice that the typical error of your estimate of teacher salary using experience, $4,610.93, is less than that of using only information on mean teacher salary, $6,491.20. • Regression analysis enables you to better predict your potential date’s salary. Summary of Fit R Squared RSquare RSquare Adj Root Mean Square Error 0.545881 0.495423 4610.93 • How much better predictions of your potential date’s salary does the simple linear regression model provide than just using the mean teacher’s salary? • This is the question that R squared addresses. • R squared: Number between 0 and 1 that measures how much of the variability in the response the regression model explains. • R squared close to 0 means that using regression for predicting Y|X isn’t much better than mean of Y, R squared close to 1 means that regression is much better than the mean of Y for predicting Y|X. R Squared Formula • Total sum of squares - Residual sum of squares R Total sum of squares 2 2 ( Y Y ) i1 i n • Total sum of squares = = the sum of squared prediction errors for using sample mean of Y to predict Y n 2 ˆ ( Y Y ) • Residual sum of squares = i1 i i , where Yˆi ˆ0 ˆ1 X i is the prediction of Yi from the least squares line. What’s a good R squared? • A good R2 depends on the context. In precise laboratory work, R2 values under 90% might be too low, but in social science contexts, when a single variable rarely explains great deal of variation in response, R2 values of 50% may be considered remarkably good. • The best measure of whether the regression model is providing predictions of Y|X that are accurate enough to be useful is the root mean square error, which tells us the typical error in using the regression to predict Y from X. Connection between Correlation and R Squared The correlation r between two variables X and Y is a measure of the direction and strength of the linear association between X and Y . The correlation ranges between -1 and 1, with a correlation near -1 indicating a strong negative linear association between X and Y , a correlation near 0 indicating little association between X and Y and a correlation near 1 indicating a strong positive association between X and Y . The R 2 from the regression of Y on X is the square of the correlation between X and Y More Information About Your Potential Date’s Salary: Prediction Intervals • From the regression model, you predict that your potential date’s salary is $54,100 and the typical error you expect to make in your prediction is $4,611. • Suppose you want to know an interval that will most of the time (say 95% of the time) contain your date’s salary? • We can find such a prediction interval by using the fact that under the simple linear regression model, the distribution of Y|X is normal, here the subpopulation of teachers with 8 years of experience has a normal distribution with estimated mean $54,100 and estimated standard deviation $4,611. Prediction Interval • A 95% prediction interval has the property that if we repeatedly take samples y1,..., yn from a population with the simple regression model where x1,..., xn are fixed at theirx current values and then sample y p xp with ,the prediction interval will yp contain 95% of the time. Best prediction of y : Yˆ Eˆ (Y | X X ) b b X • p p p 0 1 p 2 1 (X p X ) , s p RMSE 1 2 n (n 1) s X 1 n 2 s X2 ( X X ) . i i 1 n 1 95% Prediction Interval: Yˆp t.025,n 2 s p Comment: For large n, the 95% prediction interval is approximate Yˆp 2* RMSE Prediction Interval for Your Date’s Salary • Suppose your date has 8 years of experience. Yˆ 40612.14+1686.07*8=54100.7 p 2 1 (X p X ) = s p RMSE 1 n (n 1) s X 2 1 (8 6.09) 2 4610.93 1 2 11 10* 2.844 5238.07 95% Prediction Interval: Yˆp t.025,n 2 s p 54100.7 2.262*5238.07 (42252.19, 65949.21) Your date’s salary will be in the range (42252.19,65949.21) most of the time. We obtain X and S X2 from Analyze, Distribution on the X variable. Distributions Years of Experience 12.5 10 7.5 5 2.5 0 Moments Mean Std Dev Std Err Mean upper 95% Mean lower 95% Mean N 6.0909091 2.8444523 0.8576346 8.0018382 4.17998 11 Prediction Intervals in JMP • After using Fit Line, click the red triangle next to Linear Fit and click Confid Curves Indiv. 65000 60000 Salary 55000 50000 45000 40000 35000 0 2.5 5 7.5 10 12.5 Years of Experience • Use the crosshair tool (under Tools) to find the exact prediction interval for a particular x value. Approximate Prediction Intervals The 95% prediction interval is approximately Yˆp 2* RMSE . Under the simple linear regression model: Approximately 68% of the Yi ’s will be within one RMSE of their predicted value Eˆ (Y | X X ) b b X . i 0 1 i Approximately 95% of the Yi ’s will be within two RMSEs of their predicted value Eˆ (Y | X X i ) b0 b1 X i . Approximately 99% of the Yi ’s will be within two RMSEs of their predicted value Eˆ (Y | X X ) b b X . i 0 1 i Forecasting Outside the Range of the Explanatory Variable (Extrapolation) • When constructing estimates of E (Y | X p ) or predicting individual values of a Y based on x p , caution must be used if x p is outside the range of the observed x’s. The data does not provide information about whether the simple linear regression model continues to hold outside of the range of the observed x’s. • Prediction intervals only account for (1) variability in Y given X; (2) uncertainty in the estimates of the slope and intercept given that the simple linear regression model is true. When x p is outside the range of the observed x’s, the prediction interval might not be accurate. Olympic Long Jump: Length of gold medal jump (Y) vs. Year (X) Bivariate Fit of Length By Year 30 29 28 Length 27 26 25 24 23 22 21 20 1880 1900 1920 1940 1960 1980 2000 2020 Year Linear Fit Linear Fit Length = -72.49157 + 0.0504846*Year Predictions from Long Jump Simple Linear Regression Model • Predicted Olympic gold medal winning long jumps: – – – – 2012 (London): -72.49+0.0504*2012 = 29.08 feet 2032: -72.49+0.0504*2032 = 30.09 feet 3000: -72.49+0.0504*3000 = 78.96 feet 95% Prediction Interval for Year 3000: 2 1 (3000 X ) Yˆp t.025,n 2 s p 78.96 2.064* RMSE 1 26 (n 1) s X2 (67.14, 90.78) Prediction interval is not reasonable! Predicting winning distance for year 3000 is an extrapolation Association vs. Causality • A high in a simple linear regression of Y on X means that X has a strong linear relationship with Y, in other words changes in X are strongly associated with changes in the mean of Y. It does not imply that changes in X causes changes in Y. • Alternative explanations for high R2 : R2 – Reverse is true. Y causes X. – There may be a lurking (confounding) variable related to both x and y which is the common cause of x and y Salary of Presbyterian Ministers in Bivariate Fit of Salary of Presbyterian Ministers in MA By Price of Rum 50000 40000 1998 30000 1982 20000 1954 1926 1886 10000 0 0 2.5 5 7.5 10 12.5 Price of Rum Are the Presybterian ministers benefiting from the rum trade or supporting it? Neither – the lurking variable of inflation over time is the common cause of increases in Presbyterian minister’s salaries and the price of rum. Review • R squared measures how much better the regression model predicts Y than just using the mean of Y. • 95% prediction interval: interval that contains new observation’s Y given the new observation’s X with 95% probability. • Approximately 95% of observations Yi are within 2 RMSEs of their predicted value Eˆ (Y | X X i ) given their X • Cautions in Interpreting Regression: – Prediction intervals for X values outside the range of the observed X variables may not be accurate. – Regression measures the association between X and the mean of Y and does not necessarily measure the causal effect of X on Y. • Next Class – Sections 3.5.2, 3.6