* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Prediction concerning Y variable
Survey
Document related concepts
Transcript
Prediction concerning Y variable Three different research questions • What is the mean response, E(Yh), for a given level, Xh, of the predictor variable? • What would one predict a new observation, Yh(new) , to be for a given level, Xh, of the predictor variable? • What would one predict the mean of m new observations, Y h(new ) , to be for a given level, Xh, of the predictor variable? Example: Mortality and Latitude • What is the expected (mean) mortality rate for locations at 40o N latitude? • What is the predicted mortality rate for a new randomly selected location at 40o N? • What is the predicted mortality rate for 10 new randomly selected locations at 40o N? Regression Plot Mortality = 389.189 - 5.97764 Latitude S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 200 150 100 30 40 Latitude 50 Point estimators Yˆh b0 b1 X h is the best point estimator in each case. That is, it is: • the best guess of the mean response at Xh • the best guess of a new observation at Xh • the best guess of a mean of m new observations at Xh But, as always, to be confident in the answer to our research question, we should put an interval around our best guess. Interval estimation of mean response E(Yh) Sampling distribution of Y-hat-h Providing error terms εi are normally distributed: Y-hat-h is normally distributed with mean E(Yh) and variance 2 1 Xh X 2 n 2 n Xi X i 1 Implications on precision • The greater the spread in the Xi values, the smaller the variance of Y-hat-h, the more precise the prediction of E(Yh). • Given the same set of Xi values, the further Xh is from the (sample) mean of the Xi, the greater the variance of Y-hat-h, the less precise the prediction of E(Yh). Estimate of the variance Estimate variance 2 1 Xh X 2 n 2 n Xi X i 1 2 1 Xh X 2 ˆ s Yh MSE n 2 n X i X i 1 with Then, the estimated standard deviation is s Yˆh Confidence interval for E(Yh) Sample estimate ± margin of error ˆ ˆ Yh t1 ,n2 s Yh 2 The estimation in Minitab • Stat >> Regression >> Regression … • Specify response and predictor(s). • Select Options… In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values. Specify confidence level. • Click on OK. Click on OK. • Results appear in session window. Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 2 28.0 Minitab output “Fit” is Yˆh b0 b1 X h 389.19 5.9776(40) 150.08 “SE Fit” is 1 s Yˆh MSE n X X X X 2 h n i 1 2.75 2 i t0.975, 47 2.0117 Therefore, the “95% CI” for E(Yh) is 150.08 (2.0117)( 2.75) 150.08 5.53 (144.55, 155.61) Difference in precision of estimates • The mean of the 49 latitudes in the data set is 39.5o N. • SE Fit for Xh=40 is 2.75. • SE Fit for Xh=28 is 7.42 (larger as expected). • The closer Xh is to the sample mean, the narrower the confidence interval, the more precise the estimate of E(Yh). Comments on assumptions • Xh is value within scope of model, that is, within range of X values in data set, but not necessary that it is one of the X values. • It is OK to use the formula for the confidence interval for E(Yh) even if the error terms are only approximately normally distributed. • If you have a large sample, the error terms can even deviate substantially from normality without greatly affecting appropriateness of the confidence interval. Prediction of a New Observation Restatement of problem • We previously estimated the mean response E(Yh). That is, we estimated the mean of the distribution of Y at a given Xh. • Now, we want to predict a new response Yh(new). That is, we predict an individual outcome Y at a given Xh. • Most outcomes Y deviate from the mean response E(Yh). We must take this into account when we predict Yh(new). How to obtain a prediction interval if distribution of Y is known • If you know the distribution of Y, you know – its shape (say, it’s normal) – its mean (say, it’s μ (“mu”)) – its standard deviation (say, it’s σ (“sigma”)) • BASIC IDEA: Using the distribution, determine a range in which most of the Y observations will fall. Claim that the next observation will fall there, too. Example: High school GPA (X) and College GPA (Y) • Distribution of college GPA (Y) depends on high school GPA (X) through intercept and slope parameters. • Suppose: – Y is normally distributed – Mean is E(Y) = 0.10 + 0.95 X – Standard deviation σ (“Sigma”) = 0.12 • For students with X = 3.5 high school GPA: – E(Y) = 0.10 + 0.95(3.5) = 3.425 Example: 99.7% prediction interval for Yh(new) • The probability that a randomly selected high school student with a GPA of 3.5 will have a college GPA between – 3.425 - 3(0.12) = 3.065 and – 3.425 + 3(0.12) = 3.785 is 0.997. But we have a problem … • The last calculation was possible because we knew β0, β1, and σ. Hence, we knew the mean and variance, E(Y) and σ2, respectively, of the distribution of Y. • We could consider estimating E(Y) and σ2 with Y-hat-h and MSE, respectively, and applying the same method as before. • But, it’s not quite right. Here’s why. So … • We cannot be certain of the location (mean) of the distribution of Y. • Prediction limits for Yh(new) must take into account: – variation in possible location (mean) of the distribution of Y – variation in the Y of the probability distribution Variation of the prediction The variation in the prediction of a new response depends on two components: the variation due to estimating E(Yh) with Y-hat-h and the variation in Y within the probability distribution. 2 ˆ ( pred ) (Yh ) 2 2 which is estimated by: 2 2 1 1 X X X X MSE 1 n h s 2 pred MSE MSE n h 2 2 n n X X X X i i i 1 i 1 Prediction interval for Yh(new) Providing error terms εi are normally distributed: ˆ Yh t1 ,n 2 s pred 2 The prediction in Minitab • Stat >> Regression >> Regression … • Specify response and predictor(s). • Select Options… In “Prediction intervals for new observations” box, specify either the X value or a column name containing multiple X values. Specify confidence level. • Click on OK. Click on OK. • Results appear in session window. S = 19.12 R-Sq = 68.0% R-Sq(adj)= 67.3% Predicted Values for New Observations New Fit SE Fit 95.0% CI 95.0% PI 1 150.08 2.75 (144.6,155.6) (111.2,188.93) 2 221.82 7.42 (206.9,236.8) (180.6,263.07)X X denotes a row with X values away from the center Values of Predictors for New Observations New Obs Latitude 1 40.0 2 28.0 Minitab output “Fit” is Yˆh b0 b1 X h 389.19 5.9776(40) 150.08 2 X X 19.122 2.752 373.1369 1 s 2 pred MSE MSE n h n 2 X X i i 1 s( pred ) 373.1369 19.32 t0.975, 47 2.0117 Therefore, the “95% PI” for Yh(new) is 150.08 (2.0117)(19.32) 150.08 38.8595 (111.2, 188.93) As always, some comments… • In general, prediction intervals are wider than confidence intervals. • Prediction intervals are (somewhat) wider the further Xh is from the mean of the X values. • The formula for the prediction interval depends strongly on the assumption that the error terms are normally distributed. Remember the distinction … • A confidence interval concerns the estimation of an unknown parameter. It is an interval that is intended to cover the value of the unknown parameter. • A prediction interval, on the other hand, is a statement about the value to be taken by a random variable, here, the new observation Yh(new). Getting a plot of the CI and PI in Minitab • Stat >> Regression >> Fitted line plot … • Specify predictor and response. • Under Options …Select Display confidence bands. Select Display prediction bands. Specify desired confidence level. • Select OK. Select OK. Regression Plot Mortality = 389.189 - 5.97764 Latitude S = 19.1150 R-Sq = 68.0 % R-Sq(adj) = 67.3 % Mortality 250 150 Regression 95% CI 95% PI 50 30 40 Latitude 50 Row 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Xh 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 Xbar 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 39.533 sumsqX 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 1020.54 n 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 49 MSE 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 365.383 SD_EY 7.42147 6.86862 6.32406 5.79013 5.27006 4.76838 4.29156 3.84884 3.45337 3.12313 2.88066 2.74927 2.74497 2.86833 3.10416 3.42933 3.82112 4.26117 4.73606 5.23632 5.75533 6.28846 SD_Pred 20.5052 20.3116 20.1340 19.9727 19.8282 19.7008 19.5908 19.4986 19.4244 19.3685 19.3308 19.3117 19.3111 19.3290 19.3654 19.4202 19.4932 19.5842 19.6930 19.8192 19.9626 20.1228 Prediction of the mean of m new observations for given Xh Same thinking as before …just a slight adjustment • We cannot be certain of the location (mean) of the distribution of the Y. The best estimate is Y-hat-h. • Prediction limits for Yh(new) must take into account: – variation in possible location (mean) of the distribution of the Y – variation in the Y within the probability distribution Variation of the prediction The variation in the prediction of the mean of m new responses depends on two components: the variation due to estimating E(Yh) with Y-hat-h and the variation in the sample means within the probability distribution. 2 ( predmean) 2 (Yˆh ) 2 m which is estimated by: 2 2 1 1 1 X X X X MSE MSE n h s 2 predmean MSE n h 2 2 m n m n X X X X i i i 1 i 1 Prediction interval for Yh(new) Providing error terms εi are normally distributed: Yˆh t1 ,n2 s predmean 2 Predict mean of m=10 new responses “Fit” is Yˆh b0 b1 X h 389.19 5.9776(40) 150.08 2 2 X X MSE 1 19 . 12 2 s predmean MSE n h 2.752 44.12 n 2 m 10 X X i i 1 s ( predmean ) 44.12 6.64 t0.975, 47 2.0117 Therefore, the “95% PI” for Yh(new) is 150.08 (2.0117)(6.64) 150.08 13.358 (136.7, 163.4)