Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
1 Chapter 12.4: Estimation and Prediction for a New Value of x Instructor: Dr. Arnab Maity 2 So far we have learned • What is simple linear regression model (y = β0 + β1 x + e) • How to interpret model parameters • How to estimate model parameters (least squares) • Inference for slope β1 (t-test and CI, ANOVA) In this chapter, we will learn how to predict the value of y when we are given a value of x. We will also learn how to do inference about the predicted value. Example: Corrosion of steel reinforcing bars is the most important durability problem for reinforced concrete structures. Carbonation of concrete results from a chemical reaction that lowers the pH value by enough to initiate corrosion of the rebar. Representative data on x = carbonation depth (mm) and y = strength (M Pa) for a sample of core specimens taken from a particular building follows (read from a plot in the article “The Carbonation of Concrete Structures in the Tropical Environment of Singapore,” Magazine of Concrete Res., 1996: 293-300). Data are provided in Example 12.13. Simple linear regression results: Dependent Variable: strength Independent Variable: carbonation_depth strength = 27.182936 - 0.29756123 carbonation_depth Sample size: 18 R (correlation coefficient) = -0.87497382 R-sq = 0.76557918 Estimate of error standard deviation: 2.864026 Parameter estimates: Parameter Estimate Intercept 27.183 Slope -0.298 Std. Err. 1.651 0.041 DF 95% L. Limit 16 23.682 16 -0.385 95% U. Limit 30.684 -0.210 3 Question: For a given value of the covariate x = 37, what is the expected value of y? Recall that, the “true” regression model is y = β0 + xβ1 + e, where E(e) = 0. Hence we have E(y) = β0 + xβ1 . For a given value of the covariate x = 37, we use the formula above and see E(y|x = 37) = β0 + 37 · β1 . As before, we replace the unknown parameters with their corresponding least squares estimates β̂0 and β̂1 and obtain Ê(y|x = 37) = β̂0 + 37 · β̂1 . In this example, we have Ê(y|x = 37) = 27.183 + 37(−0.298) = 16.173. Question: We have just obtained a point estimate. This does not tell us how precisely the mean has been estimated. Can we construct a confidence interval? Estimation of expected value of y for a given value of x = x∗ For a given x = x∗ , we want to estimate µy|x∗ = E(y|x∗ ) = β0 + x∗ β1 . We can estimate µy|x∗ by µ̂y|x∗ = β̂0 + x∗ β̂1 • The estimator µ̂y|x∗ is a random variable as it will take different values based on different samples. • The mean value of µ̂y|x∗ is E(µ̂y|x∗ ) = β0 + x∗ β1 estimator for µy|x∗ . (= µy|x∗ ). Therefore it is unbiased • The variance of µ̂y|x∗ is V (µ̂y|x∗ ) = σ where SXX = P i (xi 2 1 (x∗ − x̄)2 + , n SXX − x̄)2 . • The standard error of µ̂y|x∗ is s SE(µ̂y|x∗ ) = • µ̂y|x∗ has a normal distribution. σ̂ 2 1 (x∗ − x̄)2 + . n SXX 4 Looking at the standard error of the µ̂y|x∗ we see that • Large error variance σ 2 results in less accurate estimation (large SE) of µ̂y|x∗ . • The estimator µ̂y|x∗ is more precise (small variance) if x∗ is near x̄ compared to the values that are further from x̄. Inferences concerning µ̂y|x∗ • The variable T = µ̂y|x∗ − µy|x∗ SE(µ̂y|x∗ ) has a t distribution with n − 2 degrees of freedom. • A 100(1 − α)% confidence interval for µy|x∗ can be constructed as µ̂y|x∗ ± tα/2,n−2 SE(µ̂y|x∗ ). In practice, we can perform such prediction and construct such prediction intervals for a grid of points for x. As a result we can obtain “point-wise” prediction band for the entire regression line. For the corrosion study data, we see the results below. Predicted values: X value Pred. Y s.e.(Pred. y) 37 16.173171 0.67524719 95% C.I. for mean (14.741711, 17.604631) 95% P.I. for new (9.9352421, 22.411099) From the confidence band (green lines around the regression line), we see that the interval is narrower close to the center of x values (i.e., closer to x̄) compared to boundaries. 5 In many cases, we are often interested in obtaining an interval of plausible value of y associated with some future observation when the predictor variable has value x = x∗ . Notice that now we are not estimating the mean of y given a value x∗ . Rather, we are trying to predict a single value of y when x = x∗ . Such a value is called a prediction of y for x = x∗ . An interval of such plausible values of y is called an prediction interval. Prediction of y for a given value of x = x∗ For a given x = x∗ , we can predict a future single value by ŷ = β̂0 + x∗ β̂1 . • The variance of prediction error is 1 (x∗ − x̄)2 2 σ 1+ + , n SXX P where SXX = i (xi − x̄)2 . • A 100(1 − α)% prediction interval for a future y when x = x∗ is s ∗ − x̄)2 1 (x ŷ ± tα/2,n−2 σ̂ 2 1 + + . n SXX We see that • The value of the estimated mean of y and the predicted value of y when x = x∗ are same (both are β0 + x∗ β1 ) • The variability in the prediction is larger than the variability of the estimation of mean. This results in the prediction interval being wider than the confidence interval. • Similar to confidence interval for mean, we get better prediction accuracy (less variability) when x∗ is close to x̄. We show the prediction results for corrosion data below. 6 1. Recall the Arsenic data example discussed in the previous lecture. We saw data on x =pH and y = arsenic removed (%) by a particular process. Data for this example is shown in the book (Example 12.2). Here are the StatCrunch output. Summary statistics: Column n Mean Variance Std. dev. Std. err. pH 18 8.4833333 1.0159647 1.0079507 0.23757627 Arsenic removed 18 37.277778 365.74183 19.124378 4.5076591 Simple linear regression results: Dependent Variable: Arsenic removed Independent Variable: pH Arsenic removed = 190.26829 - 18.034245 pH Sample size: 18 R (correlation coefficient) = -0.95049529 R-sq = 0.9034413 Estimate of error standard deviation: 6.1255839 Parameter estimates: Parameter Estimate Intercept 190.26829 Slope -18.034245 Std. Err. 12.587118 1.4739533 DF 16 16 95% L. Limit 163.58479 -21.158887 95% U. Limit 216.95179 -14.909604 (a) Estimate the mean arsenic removal percentage when pH = 8.5. and construct a 95% confidence interval. Hint: t0.025,16 = 2.12. 7 (b) Predict the arsenic removal percentage that would be observed in a future water sample with pH = 7.5, and construct a 95% prediction interval. (c) Would you recommend predicting arsenic removal percentage for a pH of 6.5? Explain. 8 2. Carbonation of concrete results from a chemical reaction that lowers pH value by enough to initiate corrosion steel reinforcing bars. Representative data on x = carbonation depth (mm) and yy = strength (MPa) for a sample of 18 core specimens were taken (The Carbonation of Concrete Structures in the Tropical Environment of Singapore, Magazine of Concrete Res., 1996, 293 – 300). Here are the StatCrunch output. Summary statistics: Column n Mean Carbonation depth 18 36.611111 Strength 18 16.288889 Simple linear regression results: Dependent Variable: Strength Independent Variable: Carbonation depth Strength = 27.182936 - 0.29756123 Carbonation depth Sample size: 18 R (correlation coefficient) = -0.87497382 R-sq = 0.76557918 Estimate of error standard deviation: 2.864026 Parameter estimates: Parameter Estimate Intercept 27.182936 Slope -0.29756123 Std. Err. 1.6513481 0.041164172 (a) Estimate the mean strength for all core specimens having a carbonation depth of 45 and construct a 95% confidence interval. 9 (b) Predict the strength of a single specimen having a carbonation depth of 45, and construct a 95% prediction interval.