Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Prediction concerning the response Y College entrance test score Simple linear regression model 22 Y E Y 0 1 x 18 14 10 Yi 0 1 x i 6 1 2 3 High school gpa 4 5 Simple linear regression model Three different research questions • What is the mean response, E(Yh), for a given value, xh, of the predictor variable? • What would one predict a new observation , Yh(new), to be for a given value, xh, of the predictor variable? • What would one predict the mean of m new observations, Y h(new ) , to be for a given value, xh , of the predictor variable? Example: Skin cancer mortality and latitude • What is the expected (mean) mortality rate for all locations at 40o N latitude? • What is the predicted mortality rate for 1 new randomly selected location at 40o N? • What is the predicted mortality rate for 10 new randomly selected locations at 40o N? Example: Skin cancer mortality and latitude Regression Plot Mortality = 389.189 - 5.97764 Latitude S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 200 150 100 30 40 Latitude 50 “Point estimators” Yˆh b0 b1 xh is the best point estimator in each case. That is, it is: • the best guess of the mean response at xh • the best guess of a new observation at xh • the best guess of a mean of m new observations at xh But, as always, to be confident in the answer to our research question, we should put an interval around our best guess. It is dangerous to “extrapolate” beyond scope of model. Regression Plot colonies = 16.0667 + 1.61576 conc S = 2.67546 R-Sq = 66.8 % R-Sq(adj) = 63.5 % 30 colonies 25 20 15 0 1 2 3 conc 4 5 6 It is dangerous to “extrapolate” beyond scope of model. Regression Plot colonies = 15.0205 + 3.22113 conc - 0.276956 conc**2 S = 2.74819 R-Sq = 69.6 % R-Sq(adj) = 64.5 % colonies 30 20 10 0 5 10 conc Confidence interval for the population mean response E(Yh) College entrance test score Again, what are we estimating? 22 Y E Y 0 1 x 18 14 10 Yi 0 1 x i 6 1 2 3 High school gpa 4 5 (1-α)100% t-interval for mean response E(Yh) Formula in words: Sample estimate ± (t-multiplier × standard error) Formula in notation: 2 1 xh x yˆ h t 1 ,n 2 MSE n x x 2 2 i Implications on precision • The greater the spread in the xi values, the narrower the confidence interval, the more precise the prediction of E(Yh). • Given the same set of xi values, the further xh is from the (sample) mean of the xi, the wider the confidence interval, the less precise the prediction of E(Yh). Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 2 28.0 Mean of Lat = 39.533 Comments on assumptions • xh is a value within scope of model, but it is not necessary that it is one of the x values in the data set. • The confidence interval formula for E(Yh) works okay even if the error terms are only approximately normally distributed. • If you have a large sample, the error terms can even deviate substantially from normality without greatly affecting appropriateness of the confidence interval. Prediction interval for a new response Yh(new) College entrance test score Again, what are we predicting? 22 Y E Y 0 1 x 18 14 10 Yi 0 1 x i 6 1 2 3 High school gpa 4 5 (1-α)100% prediction interval for new response Yh(new) Formula in words: Sample prediction ± (t-multiplier × standard error) Formula in notation: 2 1 xh x yˆ h t 1 ,n 2 MSE 1 n x x 2 2 i Prediction of Yh(new) if mean E(Y) is known Assume 2 25 so 5 Prediction of Yh(new) if mean E(Y) is known 0.08 Normal curve 0.07 0.06 0.05 0.04 0.997 0.03 0.02 0.01 0.00 47 52 57 62 67 Number of hours 72 77 Prediction of Yh(new) if mean E(Y) is not known Summary of prediction issues • We cannot be certain of the mean of the distribution of Y. • Prediction limits for Yh(new) must take into account: – variation in the possible mean of the distribution of Y – variation in the responses Y within the probability distribution Variation of the prediction The variation in the prediction of a new response depends on two components: 1. the variation due to estimating the mean E(Yh) with ŷ h 2. the variation in Y within the probability distribution (Yˆh ) 2 2 which is estimated by: 2 2 1 1 xh x xh x MSE 1 n MSE MSE n 2 2 n n x x x x i i i 1 i 1 (1-α)100% prediction interval for new response Yh(new) Formula in words: Sample prediction ± (t-multiplier × standard error) Formula in notation: 2 1 xh x yˆ h t 1 ,n 2 MSE 1 n x x 2 2 i Confidence intervals and prediction intervals for response in Minitab • Stat >> Regression >> Regression … • Specify response and predictor(s). • Select Options… – In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values. – Specify confidence level (default is 95%). • Click on OK. Click on OK. • Results appear in session window. S = 19.12 R-Sq = 68.0% R-Sq(adj)= 67.3% Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 2 28.0 Mean of Lat = 39.533 Comments on assumptions • xh is a value within scope of model, but it is not necessary that it is one of the x values in the data set. • The formula for the prediction interval depends strongly on the assumption that the error terms are normally distributed. A plot of the confidence interval and prediction interval in Minitab • Stat >> Regression >> Fitted line plot … • Specify predictor and response. • Under Options … – Select Display confidence bands. – Select Display prediction bands. – Specify desired confidence level (95% default) • Select OK. Select OK. Regression Plot Mortality = 389.189 - 5.97764 Latitude S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 250 150 Regression 95% CI 95% PI 50 30 40 Latitude 50