Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Class 11: Thurs., Oct. 14 • Finish transformations • Example Regression Analysis • Next Tuesday: Review for Midterm (I will take questions and go over practice midterm if there are no questions) • Next Thursday: Midterm • HW5 due Tuesday. • I will e-mail review notes and a practice midterm to you tomorrow. Transformations in JMP 1. Use Tukey’s Bulging rule (see handout) to determine transformations which might help. 2. After Fit Y by X, click red triangle next to Bivariate Fit and click Fit Special. Experiment with transformations suggested by Tukey’s Bulging rule. 3. Make residual plots of the residuals for transformed model vs. the original X by clicking red triangle next to Transformed Fit to … and clicking plot residuals. Choose transformations which make the residual plot have no pattern in the mean of the residuals vs. X. 4. Compare different transformations by looking for transformation with smallest root mean square error on original y-scale. If using a transformation that involves transforming y, look at root mean square error for fit measured on original scale. Bivariate Fit of Life Expectancy By Per Capita GDP Life Expectancy 80 70 60 50 40 0 5000 10000 15000 20000 25000 30000 Per Capita GDP Linear Fit Transformed Fit to Log Transformed Fit to Sqrt Transformed Fit Square Transformed Fit to Sqrt Linear Fit Life Expectancy = 56.176479 + 0.0010699 Per Capita GDP • 0.515026 0.510734 8.353485 63.86957 115 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.636551 0.633335 7.231524 63.86957 115 Transformed Fit Square Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP) Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) ` Summary of Fit Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) Life Expectancy = 47.925383 + 0.2187935 Sqrt(Per Capita GDP) Square(Life Expectancy) = 3232.1292 + 0.1374831 Per Capita GDP Fit Measured on Original Scale 0.749874 0.74766 5.999128 63.86957 115 Sum of Squared Error Root Mean Square Error RSquare Sum of Residuals 7597.7156 8.1997818 0.5327083 -70.29942 By looking at the root mean square error on the original y-scale, we see that all of the transformations improve upon the untransformed model and that the transformation to log x is by far the best. Linear Fit Transformation to -5 -15 5 -5 -15 -25 0 5000 10000 15000 20000 25000 -25 30000 0 Per Capita GDP 5000 10000 15000 20000 25000 30000 25000 30000 Per Capita GDP Transformation to Log X Transformation to 15 Y2 15 5 Residual Residual X 15 5 Residual Residual 15 -5 5 -5 -15 -15 -25 -25 0 5000 10000 15000 20000 Per Capita GDP 25000 30000 0 5000 10000 15000 20000 Per Capita GDP The transformation to Log X appears to have mostly removed a trend in the mean of the residuals. This means that E (Y | X ) 0 1 log X. There is still a problem of nonconstant variance. HowLinear doFit we use the transformation? • Life Expectancy = -7.97718 + 8.729051 log Per Capita GDP Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -7.97718 3.943378 -2.02 0.0454 log Per Capita 8.729051 0.474257 18.41 <.0001 GDP • Testing for association between Y and X: If the simple linear regression model holds for f(Y) and g(X), then Y and X are associated if and only if the slope in the regression of f(Y) and g(X) does not equal zero. P-value for test that slope is zero is <.0001: Strong evidence that per capita GDP and life expectancy are associated. • Prediction and mean response: What would you predict the life expectancy to be for a country with a per capita GDP of $20,000? Eˆ (Y | X 20,000) Eˆ (Y | log X log 20,000) Eˆ (Y | log X 9.9035) 7.9772 8.7291* 9.9035 78.47 More Examples of finding E(Y|X) using a transformation • Suppose simple linear regression model holds for E (Y | X ) : Then Eˆ (Y | X 20,000) Eˆ (Y | X 20,000 ) Eˆ (Y | X 141.42) 47.93 0.219 *141.42 78.91 • Suppose simple linear regression model 2 E ( Y | X) : holds for Then ˆ ˆ 2 E (Y | X 20,000) E (Y | X 20,000) 3232.192 0.13748 * 20,0000 5891.79 77.34 CIs for Mean Response and Prediction Intervals Bivariate Fit of Life Expectancy By Per Capita GDP 100 95% Confidence Interval for Mean Response for E(Y|X=20,000) = (76.89,80.50) Life Expectancy 90 80 95% Prediction Interval for Y|X=20,000 = (66.42,91.69) 70 60 50 40 0 5000 10000 15000 20000 25000 30000 Per Capita GDP Note: To expand Y-axis or X-axis, right click on the X-axis, click Axis Settings and change the minimum/maximum. In order to fully see the prediction intervals, I needed to expand the Y-axis. Transformed Fit to Log Transformed Fit to Log Life Expectancy = -7.97718 + 8.729051 Log(Per Capita GDP) Another Example of Transformations: Y=Count of tree seeds, X= weight of tree Bivariate Fit of Seed Count By Seed weight (mg) 30000 25000 Seed Count 20000 15000 10000 5000 0 -5000 -1000 0 1000 2000 3000 Seed w eight (mg) 4000 5000 Bivariate Fit of Seed Count By Seed weight (mg) 30000 25000 Seed Count 20000 15000 10000 5000 0 -5000 -1000 0 1000 2000 3000 Seed w eight (mg) Linear Fit Transformed Fit Log to Log Transformed Fit to Log 4000 5000 Linear Fit Seed Count = 6751.7179 - 2.1076776 Seed weight (mg) Transformed Fit to Log Seed Count = 12174.621 - 1672.3962 Log(Seed weight (mg)) Summary of Fit Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.220603 0.174756 6199.931 4398.474 19 RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.566422 0.540918 4624.247 4398.474 19 Transformed Fit Log to Log Log(Seed Count) = 9.758665 - 0.5670124 Log(Seed weight (mg)) Fit Measured on Original Scale Sum of Squared Error 161960739 Root Mean Square 3086.6004 Error RSquare 0.8068273 Sum of Residuals 3142.2066 By looking at the root mean square error on the original y-scale, we see that Both of the transformations improve upon the untransformed model and that the transformation to log y and log x is by far the best. Prediction using the log y/log x transformation • What is the predicted seed count of a tree that weights 50 mg? • Math trick: exp{log(y)}=y (Remember by log, we always mean the natural log, ln), i.e., elog10 10 Eˆ (Y | X 50) exp{ Eˆ (log Y | X 50)} exp{ Eˆ (log Y | log X log 50)} exp{ Eˆ (log Y | log X 3.912)} exp{9.7587 0.5670 * 3.912} exp{7.5406} 1882.96 Assumptions for linear regression and their importance to inferences Inference Assumptions that are important Point prediction, point estimation Confidence interval for slope, hypothesis test for slope, confidence interval for mean response Prediction interval Linearity (specification of mean of Y|X is correct), independence Linearity, constant variance, independence, normality (only if n<30) Linearity, constant variance, independence, normality Transformations to Remedy Constant Variance and Normality Nonconstant Variance • When the variance of Y|X increases with X, try transforming Y to log Y or Y to • When the variance of Y|X decreases with X,Ytry transforming Y to 1/Y or Y to Y2 Nonnormality • When the distribution of the residuals is skewed to the right, try transforming Y to log Y. • When the distribution of the residuals is skewed to the left, try transforming Y to Y2 Steps in Regression Analysis 1. 2. 3. 4. 5. 6. Define the question of interest. Review the design of the study to see if it can answer question of interest. Correct errors in the data. Explore the data using a scatterplot. Fit an initial regression model (possibly using a transformation). Check the assumptions of the regression model. Investigate influential points. Infer answers to the questions of interest using appropriate tools (e.g., confidence intervals, hypothesis tests, prediction intervals). Communicate the results to the intended audience.