Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ST 370 Probability and Statistics for Engineers Hand Calculations Calculation of βˆ0 , β̂1 , and SSE from the formulas is tedious; these shortcuts can help: !2 n n n X X X 1 xi Sxx = (xi − x̄)2 = xi2 − n i=1 i=1 i=1 ! n ! n n n X X X 1 X Sxy = (xi − x̄)(yi − ȳ ) = xi yi − xi yi n i=1 i=1 i=1 i=1 !2 n n n X X 1 X SST = Syy = (yi − ȳ )2 = yi2 − yi n i=1 i=1 i=1 SSE = SST − β̂1 Sxy 1 / 18 Simple Linear Regression ST 370 Probability and Statistics for Engineers Standard errors Calculations of standard errors begin with an estimate of σ, the standard deviation of the noise term . The least squares residuals are ei = yi − (β̂0 + β̂1 xi ) so the residual sum of squares is n X SSE = ei2 . i=1 Because two parameters were estimated in finding the residuals, the residual degrees of freedom are n − 2, and the estimate of σ 2 is σ̂ 2 = MSE = 2 / 18 SSE . n−2 Simple Linear Regression ST 370 Probability and Statistics for Engineers The estimated standard errors of the least squares estimates are r 1 se(β̂1 ) = σ̂ Sxx and s se(β̂0 ) = σ̂ 3 / 18 1 x̄ 2 + . n Sxx Simple Linear Regression ST 370 Probability and Statistics for Engineers These estimated standard errors, especially se(β̂1 ), are used to set up confidence intervals like β̂1 ± tα/2,ν × estimated standard error and test statistics like tobs = 4 / 18 β̂1 . estimated standard error Simple Linear Regression ST 370 Probability and Statistics for Engineers Predicting a New Observation Regression equations are often used to predict what the response would be in a new experiment, in which the predictor value is say xnew . If we carried out many experiments with x = xnew , we would expect the average response to be β0 + β1 xnew , which we estimate by Ŷnew = β̂0 + β̂1 xnew . 5 / 18 Simple Linear Regression Predicting a New Observation ST 370 Probability and Statistics for Engineers When this mean response is the quantity of interest, you use its estimated standard error s 1 (xnew − x̄)2 se(Ŷnew ) = σ̂ + n Sxx to set up confidence intervals. Note that if xnew = 0, Ŷnew is just β̂0 , and this expression reduces to the estimated standard error given earlier. √ Note also that se(Ŷnew ) is always at least σ̂/ n, and increases as xnew gets farther from x̄. 6 / 18 Simple Linear Regression Predicting a New Observation ST 370 Probability and Statistics for Engineers Often, however, what is wanted is not a confidence interval for the mean response, but a prediction interval for a single new observation Ynew at x = xnew . Now Ynew = β0 + β1 xnew + new so our uncertainty about Ynew comes from two sources: We have to use estimates β̂0 and β̂1 instead of the true values; We have no information about the value of new . 7 / 18 Simple Linear Regression Predicting a New Observation ST 370 Probability and Statistics for Engineers The best prediction of Ynew is still Ŷnew , but the estimated prediction standard error is s 1 (xnew − x̄)2 pse(Ynew ) = σ̂ 1 + + n Sxx We use the estimated prediction standard error to set up a 100(1 − α) prediction interval for Ynew : Ŷnew ± tα/2,ν × pse(Ynew ). 8 / 18 Simple Linear Regression Predicting a New Observation ST 370 Probability and Statistics for Engineers In R The predict() method can produce either A confidence interval, for the mean response; A prediction interval, for a single new observation. oxygenLm <- lm(Purity ~ HC, oxygen) # confidence interval for mean response: predict(oxygenLm, newdata = data.frame(HC = 1.0), interval = "confidence") # prediction interval for single observation: predict(oxygenLm, newdata = data.frame(HC = 1.0), interval = "prediction") 9 / 18 Simple Linear Regression Predicting a New Observation ST 370 Probability and Statistics for Engineers Indicator variables Typically, the predictor variable x is a controllable or measured variable; however, sometimes it is an artificial quantity, constructed to convey some information. An indicator variable is a variable that takes only the values 0 and 1; the observed data can then be divided into two groups: those where x = 0 and those where x = 1. 10 / 18 Simple Linear Regression Indicator variables ST 370 Probability and Statistics for Engineers When x = 0, the regression model Y = β0 + β1 x + simplifies to Y = β0 + so the mean response for this group is β0 . When x = 1, the regression model becomes Y = β0 + β1 + so the mean response for this group is β0 + β1 . 11 / 18 Simple Linear Regression Indicator variables ST 370 Probability and Statistics for Engineers So the interpretation of the coefficients β0 and β1 is: β0 is the mean response for group 0; β1 is the difference between the mean response for group 1 and the mean response for group 0. When we discussed the factorial design with one factor having two levels, we used the model Yi,j = µ + τi + i,j , i = 1 or 2, j = 1, 2, . . . , n with the constraint τ1 = 0; so: µ is the mean response for the first (baseline) level of the factor; τ2 is the difference between the mean response for the second level and the mean response for the first level. 12 / 18 Simple Linear Regression Indicator variables ST 370 Probability and Statistics for Engineers The two models represent the same features, in a different notation. That is: the factorial model with a single factor having two levels is essentially the same as the regression model with an indicator variable as the predictor. That insight is not especially helpful in this context, but when we move on to regression models with more than one predictor it is very useful. Using indicator variables allows us to put regression models and factorial models into a unified framework, called the general linear model. 13 / 18 Simple Linear Regression Indicator variables ST 370 Probability and Statistics for Engineers Coefficient of Determination How well does a regression model fit a given set of data? A careful answer must look at the context: does the model give usefully precise predictions? What is “usefully precise” varies from one context to another. A quick but not careful answer is to look at the coefficient of determination, R 2 : SSE R2 = 1 − , SST the fraction of variability in Y explained by the predictor x. 14 / 18 Simple Linear Regression Coefficient of Determination ST 370 Probability and Statistics for Engineers In R For the Oxygen example, part of the regression output is: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 74.283 1.593 46.62 < 2e-16 *** HC 14.947 1.317 11.35 1.23e-09 *** --Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1.087 on 18 degrees of freedom Multiple R-squared: 0.8774,Adjusted R-squared: 0.8706 F-statistic: 128.9 on 1 and 18 DF, p-value: 1.227e-09 The R 2 is reported as the Multiple R-squared, 0.8774. We might report that “87.74% of the variability in oxygen purity is explained by variations in hydrocarbon levels”. 15 / 18 Simple Linear Regression Coefficient of Determination ST 370 Probability and Statistics for Engineers Regression with Transformed Variables In some applications of regression, either the response Y or the predictor x may need to be transformed, and sometimes both. Example: wind turbine Response: DC Output Predictor: Wind speed In R turbine <- read.csv("Data/Table-11-05.csv") with(turbine, plot(WindSpeed, Output)) with(turbine, plot(-1 / WindSpeed, Output)) 16 / 18 Simple Linear Regression Regression with Transformed Variables ST 370 Probability and Statistics for Engineers No straight line will give a good fit to the plot of Output versus Wind Speed. The plot of Output versus (-1 / Wind Speed) looks more appropriate for a straight line fit. Try both: summary(lm(Output ~ WindSpeed, turbine)) summary(lm(Output ~ I(-1 / WindSpeed), turbine)) Note: many symbols like “-” and “/” that usually represent arithmetic operations have special meanings in a formula, so an expression like -1 / WindSpeed must be “wrapped” in the identity function I(.). 17 / 18 Simple Linear Regression Regression with Transformed Variables ST 370 Probability and Statistics for Engineers Note that the second model has R 2 = 0.98, while the first has R 2 = 0.8745. That is, using -1 / WindSpeed as the predictor accounts for much more of the variability in Output than using WindSpeed. Interestingly, using -1 / WindSpeed also results in the model equation ŷ = 2.98 − 6.93 , x which predicts that the Output will never be higher than 2.98, no matter how high the Wind Speed. 18 / 18 Simple Linear Regression Regression with Transformed Variables