Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Project #3 Answers STAT 870 Fall 2012 Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document. 1) (42 total points) Complete the following for the model E(taste) = 0 + 1Acetic + 2H2S + 3Lactic: a) (3 points) Find the sample regression model using the lm() function. > library(RODBC) > z<-odbcConnectExcel(xls.file = "C:\\chris\\unl\\Dropbox\\NEW\\STAT870\\projects \\Fall2012\\cheese.xls") > cheese<-sqlFetch(channel = z, sqtable = "Sheet1") > close(z) > head(cheese) Case taste Acetic H2S Lactic 1 1 12.3 4.543 3.135 0.86 2 2 20.9 5.159 5.043 1.53 3 3 39.0 5.366 5.438 1.57 4 4 47.9 5.759 7.496 1.81 5 5 5.6 4.663 3.807 0.99 6 6 25.9 5.697 7.601 1.09 > mod.fit<-lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese) > summary(mod.fit) Call: lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese) Residuals: Min 1Q -17.390 -6.612 Median -1.009 3Q 4.908 Max 25.449 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -28.8768 19.7354 -1.463 0.15540 Acetic 0.3277 4.4598 0.073 0.94198 H2S 3.9118 1.2484 3.133 0.00425 ** Lactic 19.6705 8.6291 2.280 0.03108 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 10.13 on 26 degrees of freedom Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 taste 28.88 0.3277Acetic 3.9118H2S 19.67Lactic b) (8 points) Reproduce all of the numerical measures given in the printed output of summary.lm() through using matrix methods. Because there are a lot of calculations to be completed, you need to organize everything very well to receive full credit. 1 > > > > #Find b X<-as.matrix(cbind(1, cheese[,-c(1,2)])) Y<-cheese$taste data.frame(X, Y) X1 Acetic H2S Lactic Y 1 1 4.543 3.135 0.86 12.3 2 1 5.159 5.043 1.53 20.9 3 1 5.366 5.438 1.57 39.0 4 1 5.759 7.496 1.81 47.9 5 1 4.663 3.807 0.99 5.6 6 1 5.697 7.601 1.09 25.9 7 1 5.892 8.726 1.29 37.3 8 1 6.078 7.966 1.78 21.9 9 1 4.898 3.850 1.29 18.1 10 1 5.242 4.174 1.58 21.0 11 1 5.740 6.142 1.68 34.9 12 1 6.446 7.908 1.90 57.2 13 1 4.477 2.996 1.06 0.7 14 1 5.236 4.942 1.30 25.9 15 1 6.151 6.752 1.52 54.9 16 1 6.365 9.588 1.74 40.9 17 1 4.787 3.912 1.16 15.9 18 1 5.412 4.700 1.49 6.4 19 1 5.247 6.174 1.63 18.0 20 1 5.438 9.064 1.99 38.9 21 1 4.564 4.949 1.15 14.0 22 1 5.298 5.220 1.33 15.2 23 1 5.455 9.242 1.44 32.0 24 1 5.855 10.199 2.01 56.7 25 1 5.366 3.664 1.31 16.8 26 1 6.043 3.219 1.46 11.6 27 1 6.458 6.962 1.72 26.5 28 1 5.328 3.912 1.25 0.7 29 1 5.802 6.685 1.08 13.4 30 1 6.176 4.787 1.25 5.5 > b<-solve(t(X)%*%X) %*% t(X)%*%Y > #b > > > > > > > > > #Residual based measures Y.hat<-X%*%b e<-Y-Y.hat n<-length(Y) SSE<-t(e)%*%e #summary(e) p<-nrow(b) df<-n-p MSE<-SSE/df > > > > > #Standard errors cov.beta.hat<-as.numeric(MSE)*solve(t(X)%*%X) #cov.beta.hat SE<-sqrt(diag(cov.beta.hat)) #SE > > > > #t statistic t.stat<-b/SE #t.stat p.valuet<-2*(1-pt(q = abs(t.stat), df = df)) > #R^2 2 > > > > > > > J<-matrix(data = 1, nrow = n, ncol = n) SSTO<-t(Y)%*%Y-1/n*t(Y)%*%J%*%Y SSR<-SSTO - SSE R.sq<-SSR/SSTO #R.sq adj.R.sq<-1 - (n-1)/df * (1 - R.sq) #adj.R.sq > > > > > #F MSR<-SSR/(p-1) F.stat<-MSR/MSE #F.stat p.valueF<-1 - pf(q = F.stat, df1 = p-1, df2 = df) > #Similar output to summary.lm() > cat("Residuals: \n") Residuals: > t(summary(e)) V1 Min. :-17.391 1st Qu.: -6.612 3rd Qu.: 4.908 Max. : 25.449 Median : -1.009 Mean : 0.000 > cat("Coefficients: \n") Coefficients: > data.frame(names = c("(Intercept)", names(cheese)[3:5]), Estimate = round(b,4), Std.Error = round(SE,4), "t.value" = round(t.stat,3), p.value = round(p.valuet,5), row.names = NULL) names Estimate Std.Error t.value p.value 1 (Intercept) -28.8768 19.7354 -1.463 0.15540 2 Acetic 0.3277 4.4598 0.073 0.94198 3 H2S 3.9118 1.2484 3.133 0.00425 4 Lactic 19.6705 8.6291 2.280 0.03108 > cat("Residual standard error:", round(sqrt(MSE),2), "on", df, "degrees of freedom \n") Residual standard error: 10.13 on 26 degrees of freedom > cat("Multiple R-squared: ", round(R.sq, 4), ", Adjusted R-squared: ", round(adj.R.sq, 4), "\n", sep = "") Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116 > cat("F-statistic:", round(F.stat, 2), "on", p-1, "and", df, "DF, p-value:", signif(p.valueF, 3), "\n") F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06 c) (3 points) Perform an overall F-test to determine if the model is useful for estimating taste. Use =0.05. i) ii) iii) iv) v) H0: 1 = 2 = 3 = 0 vs. Ha: At least one 0 F = 16.22, p-value = 3.810-6 = 0.05 Because 3.810-6 < 0.05, reject H0 There is sufficient evidence that at least one of acetic acid, hydrogen sulfide, and lactic acid (in a linear manner) is important in predicting taste. 3 d) (3 points) How does each of the predictor variables affect taste? Give the correct interpretation for each of the predictor variables using the sample regression model. Make sure to fully explain your answers. i) For every one unit increase in acetic acid, the taste is estimated to increase by 0.3277 ii) For every one unit increase in hydrogen sulfide, the taste is estimated to increase by 3.91 iii) For every one unit increase in lactic acid, the taste is estimated to increase by 19.67 e) (6 points) Use a t-test for acetic acid (H0: 1 = 0 vs. Ha: 1 0) with the help of the output from the summary() function. Perform this same test using a partial F-test and the help of the anova() function. Perform the same test one more time with a partial F-test and the Anova() function. Use = 0.05. i) ii) iii) iv) v) H0: 1 = 0 vs. Ha: 1 0 t = 0.073 (from previous output in a)), F = 0.0732 = 0.0054, p-value = 0.9420 = 0.05 Because 0.9420 > 0.05, do not reject H0 There is not sufficient evidence to indicate that acetic acid is important in predicting taste given that hydrogen sulfide and lactic acid are in the model. > mod.fit2<-lm(formula = taste ~ H2S + Lactic, data = cheese) > #summary(mod.fit2) #Did not include output here to save space > anova(mod.fit, mod.fit2) Analysis of Variance Table Model 1: Model 2: Res.Df 1 26 2 27 taste ~ Acetic + H2S + Lactic taste ~ H2S + Lactic RSS Df Sum of Sq F Pr(>F) 2668.4 2669.0 -1 -0.55427 0.0054 0.942 > library(package = car) > Anova(mod.fit) Anova Table (Type II tests) Response: taste Sum Sq Df F value Pr(>F) Acetic 0.55 1 0.0054 0.941980 H2S 1007.66 1 9.8182 0.004247 ** Lactic 533.32 1 5.1964 0.031079 * Residuals 2668.41 26 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 f) (3 points) The tests for acetic acid in the previous part resulted in large p-values. However, the same type of test performed for the model E(taste) = 0 + 1Acetic had a low p-value. Why does this occur? This occurs because the other variables (hydrogen sulfide and lactic acid) must be accounting for the same type of information in the taste response as acetic acid does. g) (5 points) For the mean values of acetic acid, hydrogen sulfide, and lactic acid, estimate the value of taste and find a 95% confidence interval for the mean response. Use the predict() 4 function to aid in your calculations. Compare your answer here to what would be obtained with using the model E(taste) = 0 + 1Acetic as in project #1. Discuss why the differences in the confidence intervals occur. > predict(object = mod.fit, newdata = data.frame(Acetic = mean(cheese$Acetic), H2S = mean(cheese$H2S), Lactic = mean(cheese$Lactic)), level = 0.95, interval = "confidence") fit lwr upr 1 24.53333 20.73142 28.33525 taste 24.53 and 20.73 < E(taste) < 28.34 When I use the model only with acetic acid as the predictor variable, I obtain > predict(object = mod.fit1, newdata = data.frame(Acetic = mean(cheese$Acetic)), level = 0.95, interval = "confidence") fit lwr upr 1 24.53333 19.36438 29.70229 taste 24.53 and 19.36 < E(taste) < 29.70 Interestingly, the estimates of taste are the same (this will not usually occur in other applications). The confidence interval for the model with 3 predictor variables is smaller in width to the confidence interval with only the acetic acid predictor variable. This occurs because hydrogen sulfide and lactic acid are important predictor variables (see t-test results). Thus, more variability in the response is explained by the model with 3 predictor variables rather than only 1. This causes the SSE and then MSE to be smaller, which then causes the confidence interval to be smaller in width for the three predictor variable model. h) (5 points) Using examine.mod.multiple() and the original model for this problem, comment on 1) linearity of the regression model, 2) constant error variance, 3) outliers, and 4) normality of . You only need to include the plots that allow you to asses these 4 items here in order to save space. > save.it<-examine.mod.multiple(mod.fit.obj = mod.fit, const.var.test = TRUE, boxcox.find = TRUE) * ei vs. estimated mean response 2 1 0 -3 -2 -1 Semistud. residuals 10 0 -10 Residuals 20 3 Residuals vs. estimated mean response 0 10 20 30 40 Estimated mean response 50 0 10 20 30 40 50 Estimated mean response 5 Residuals vs. predictor 2 Histogram of semistud. residuals 0.3 0.1 0.2 Density 10 0.0 -10 -10 0 Residuals 10 20 20 Residuals vs. observation number 0 Residuals 10 0 -10 Residuals 20 Residuals vs. predictor 1 4.5 5.0 5.5 6.0 6.5 4 Predictor variable 1 6 8 10 0 5 Predictor variable 2 10 15 20 25 30 -2 -1 Observation number Residuals vs. predictor 3 0 1 2 3 Semistud. residuals Normal Q-Q Plot Box-Cox transformation plot 1.4 1.6 1.8 2.0 Predictor variable 3 -50 -150 -100 1 log-Likelihood 2 1.2 0 Semistud. residuals 1.0 -1 10 0 -10 Residuals 20 95% -2 -1 0 1 2 -2 -1 0 1 2 Theoretical Quantiles > save.it$levene [,1] [,2] [,3] [1,] 1 0.8407 0.36702910 [2,] 2 3.5560 0.06974367 [3,] 3 3.1419 0.08718428 > save.it$bp Breusch-Pagan test data: mod.fit.obj BP = 4.494, df = 3, p-value = 0.2128 > save.it$lambda.hat [1] 0.67 i) Linearity of the regression model Plots of the residuals versus each of the predictor variables contain a random scattering of points. Therefore, no transformations are suggested by the plots. ii) Constant error variance The plot of ei vs. Ŷi contains a random scattering of points. There is not any indication of non-constant error variance. The BP test results in a p-value of 0.21 indicating there is not sufficient evidence to indicate non-constant error variance. The Levene’s test results in a large p-value of 0.37 for acetic acid and two moderately small p-values for hydrogen sulfide (p-value = 0.07) and lactic acid (0.09). Thus, there may be problems with the non-constant error variance assumption, but the evidence is not strong. The 95% confidence interval for from the Box-Cox transformation appears to have an upper bound of approximately 1. 6 Also, ̂ = 0.67. Thus, there again is some evidence of non-constant error variance, but it is not strong. iii) Outliers The plot of ei vs. Ŷi does not have any points outside of ei = –3 and 3. There are not any outliers as indicated by the plots. iv) Normality of i The QQ-plot of ei has most of points lying on the straight line, with a few deviations in the right side. Overall, this plot does not provide sufficient evidence against normality to warrant a change in the model. The histogram has a somewhat mound shape like a normal distribution, but the left tail does not have any observations. However, with only a sample size of the 30, this is not necessarily surprising. Overall, this plot also does not provide sufficient evidence against normality to warrant a change in the model. i) Because the t-test for acetic acid had a large p-value, the “backward elimination” model building procedure would remove this variable from the overall model. Using the model E(taste) = 0 + 1H2S + 2Lactic, complete the following. i) (2 points) Find the sample regression model. > mod.fit2<-lm(formula = taste ~ H2S + Lactic, data = cheese) > summary(mod.fit2) Call: lm(formula = taste ~ H2S + Lactic, data = cheese) Residuals: Min 1Q -17.343 -6.530 Median -1.164 3Q 4.844 Max 25.618 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -27.592 8.982 -3.072 0.00481 ** H2S 3.946 1.136 3.475 0.00174 ** Lactic 19.887 7.959 2.499 0.01885 * --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 9.942 on 27 degrees of freedom Multiple R-squared: 0.6517, Adjusted R-squared: 0.6259 F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07 taste 27.59 3.946H2S 19.887Lactic ii) (4 points) Plot the sample regression model using the scatter3d() function. Include a few different angles of the plot in your answer. Interpret the effects of hydrogen sulfide and lactic acid using the plot. > library(Rcmdr) > scatter3d(formula = taste ~ bg="white", grid=TRUE) H2S + Lactic, data = cheese, fit="linear", 7 As hydrogen sulfide and lactic acid increase, the estimated value for taste increases. 8