Download Answers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Types of artificial neural networks wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Linear least squares (mathematics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Transcript
Project #3 Answers
STAT 870
Fall 2012
Complete the following problems below. Within each part, include your R program output with code
inside of it and any additional information needed to explain your answer. Note that you will need to
edit your output and code in order to make it look nice after you copy and paste it into your Word
document.
1) (42 total points) Complete the following for the model E(taste) = 0 + 1Acetic + 2H2S + 3Lactic:
a) (3 points) Find the sample regression model using the lm() function.
> library(RODBC)
> z<-odbcConnectExcel(xls.file = "C:\\chris\\unl\\Dropbox\\NEW\\STAT870\\projects
\\Fall2012\\cheese.xls")
> cheese<-sqlFetch(channel = z, sqtable = "Sheet1")
> close(z)
> head(cheese)
Case taste Acetic
H2S Lactic
1
1 12.3 4.543 3.135
0.86
2
2 20.9 5.159 5.043
1.53
3
3 39.0 5.366 5.438
1.57
4
4 47.9 5.759 7.496
1.81
5
5
5.6 4.663 3.807
0.99
6
6 25.9 5.697 7.601
1.09
> mod.fit<-lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese)
> summary(mod.fit)
Call:
lm(formula = taste ~ Acetic + H2S + Lactic, data = cheese)
Residuals:
Min
1Q
-17.390 -6.612
Median
-1.009
3Q
4.908
Max
25.449
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -28.8768
19.7354 -1.463 0.15540
Acetic
0.3277
4.4598
0.073 0.94198
H2S
3.9118
1.2484
3.133 0.00425 **
Lactic
19.6705
8.6291
2.280 0.03108 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.13 on 26 degrees of freedom
Multiple R-squared: 0.6518,
Adjusted R-squared: 0.6116
F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06
taste  28.88  0.3277Acetic  3.9118H2S  19.67Lactic
b) (8 points) Reproduce all of the numerical measures given in the printed output of summary.lm()
through using matrix methods. Because there are a lot of calculations to be completed, you
need to organize everything very well to receive full credit.
1
>
>
>
>
#Find b
X<-as.matrix(cbind(1, cheese[,-c(1,2)]))
Y<-cheese$taste
data.frame(X, Y)
X1 Acetic
H2S Lactic
Y
1
1 4.543 3.135
0.86 12.3
2
1 5.159 5.043
1.53 20.9
3
1 5.366 5.438
1.57 39.0
4
1 5.759 7.496
1.81 47.9
5
1 4.663 3.807
0.99 5.6
6
1 5.697 7.601
1.09 25.9
7
1 5.892 8.726
1.29 37.3
8
1 6.078 7.966
1.78 21.9
9
1 4.898 3.850
1.29 18.1
10 1 5.242 4.174
1.58 21.0
11 1 5.740 6.142
1.68 34.9
12 1 6.446 7.908
1.90 57.2
13 1 4.477 2.996
1.06 0.7
14 1 5.236 4.942
1.30 25.9
15 1 6.151 6.752
1.52 54.9
16 1 6.365 9.588
1.74 40.9
17 1 4.787 3.912
1.16 15.9
18 1 5.412 4.700
1.49 6.4
19 1 5.247 6.174
1.63 18.0
20 1 5.438 9.064
1.99 38.9
21 1 4.564 4.949
1.15 14.0
22 1 5.298 5.220
1.33 15.2
23 1 5.455 9.242
1.44 32.0
24 1 5.855 10.199
2.01 56.7
25 1 5.366 3.664
1.31 16.8
26 1 6.043 3.219
1.46 11.6
27 1 6.458 6.962
1.72 26.5
28 1 5.328 3.912
1.25 0.7
29 1 5.802 6.685
1.08 13.4
30 1 6.176 4.787
1.25 5.5
> b<-solve(t(X)%*%X) %*% t(X)%*%Y
> #b
>
>
>
>
>
>
>
>
>
#Residual based measures
Y.hat<-X%*%b
e<-Y-Y.hat
n<-length(Y)
SSE<-t(e)%*%e
#summary(e)
p<-nrow(b)
df<-n-p
MSE<-SSE/df
>
>
>
>
>
#Standard errors
cov.beta.hat<-as.numeric(MSE)*solve(t(X)%*%X)
#cov.beta.hat
SE<-sqrt(diag(cov.beta.hat))
#SE
>
>
>
>
#t statistic
t.stat<-b/SE
#t.stat
p.valuet<-2*(1-pt(q = abs(t.stat), df = df))
> #R^2
2
>
>
>
>
>
>
>
J<-matrix(data = 1, nrow = n, ncol = n)
SSTO<-t(Y)%*%Y-1/n*t(Y)%*%J%*%Y
SSR<-SSTO - SSE
R.sq<-SSR/SSTO
#R.sq
adj.R.sq<-1 - (n-1)/df * (1 - R.sq)
#adj.R.sq
>
>
>
>
>
#F
MSR<-SSR/(p-1)
F.stat<-MSR/MSE
#F.stat
p.valueF<-1 - pf(q = F.stat, df1 = p-1, df2 = df)
> #Similar output to summary.lm()
> cat("Residuals: \n")
Residuals:
> t(summary(e))
V1 Min.
:-17.391
1st Qu.: -6.612
3rd Qu.: 4.908
Max.
: 25.449
Median : -1.009
Mean
:
0.000
> cat("Coefficients: \n")
Coefficients:
> data.frame(names = c("(Intercept)", names(cheese)[3:5]), Estimate = round(b,4),
Std.Error = round(SE,4), "t.value" = round(t.stat,3), p.value =
round(p.valuet,5), row.names = NULL)
names Estimate Std.Error t.value p.value
1 (Intercept) -28.8768
19.7354 -1.463 0.15540
2
Acetic
0.3277
4.4598
0.073 0.94198
3
H2S
3.9118
1.2484
3.133 0.00425
4
Lactic 19.6705
8.6291
2.280 0.03108
> cat("Residual standard error:", round(sqrt(MSE),2), "on", df, "degrees of freedom
\n")
Residual standard error: 10.13 on 26 degrees of freedom
> cat("Multiple R-squared: ", round(R.sq, 4), ", Adjusted R-squared: ",
round(adj.R.sq, 4), "\n", sep = "")
Multiple R-squared: 0.6518, Adjusted R-squared: 0.6116
> cat("F-statistic:", round(F.stat, 2), "on", p-1, "and", df, "DF, p-value:",
signif(p.valueF, 3), "\n")
F-statistic: 16.22 on 3 and 26 DF, p-value: 3.81e-06
c) (3 points) Perform an overall F-test to determine if the model is useful for estimating taste. Use
=0.05.
i)
ii)
iii)
iv)
v)
H0: 1 = 2 = 3 = 0 vs. Ha: At least one   0
F = 16.22, p-value = 3.810-6
 = 0.05
Because 3.810-6 < 0.05, reject H0
There is sufficient evidence that at least one of acetic acid, hydrogen sulfide, and lactic acid
(in a linear manner) is important in predicting taste.
3
d) (3 points) How does each of the predictor variables affect taste? Give the correct interpretation
for each of the predictor variables using the sample regression model. Make sure to fully
explain your answers.
i) For every one unit increase in acetic acid, the taste is estimated to increase by 0.3277
ii) For every one unit increase in hydrogen sulfide, the taste is estimated to increase by 3.91
iii) For every one unit increase in lactic acid, the taste is estimated to increase by 19.67
e) (6 points) Use a t-test for acetic acid (H0: 1 = 0 vs. Ha: 1  0) with the help of the output from
the summary() function. Perform this same test using a partial F-test and the help of the
anova() function. Perform the same test one more time with a partial F-test and the Anova()
function. Use  = 0.05.
i)
ii)
iii)
iv)
v)
H0: 1 = 0 vs. Ha: 1  0
t = 0.073 (from previous output in a)), F = 0.0732 = 0.0054, p-value = 0.9420
 = 0.05
Because 0.9420 > 0.05, do not reject H0
There is not sufficient evidence to indicate that acetic acid is important in predicting taste
given that hydrogen sulfide and lactic acid are in the model.
> mod.fit2<-lm(formula = taste ~ H2S + Lactic, data = cheese)
> #summary(mod.fit2) #Did not include output here to save space
> anova(mod.fit, mod.fit2)
Analysis of Variance Table
Model 1:
Model 2:
Res.Df
1
26
2
27
taste ~ Acetic + H2S + Lactic
taste ~ H2S + Lactic
RSS Df Sum of Sq
F Pr(>F)
2668.4
2669.0 -1 -0.55427 0.0054 0.942
> library(package = car)
> Anova(mod.fit)
Anova Table (Type II tests)
Response: taste
Sum Sq Df F value
Pr(>F)
Acetic
0.55 1 0.0054 0.941980
H2S
1007.66 1 9.8182 0.004247 **
Lactic
533.32 1 5.1964 0.031079 *
Residuals 2668.41 26
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
f) (3 points) The tests for acetic acid in the previous part resulted in large p-values. However, the
same type of test performed for the model E(taste) = 0 + 1Acetic had a low p-value. Why
does this occur?
This occurs because the other variables (hydrogen sulfide and lactic acid) must be accounting
for the same type of information in the taste response as acetic acid does.
g) (5 points) For the mean values of acetic acid, hydrogen sulfide, and lactic acid, estimate the
value of taste and find a 95% confidence interval for the mean response. Use the predict()
4
function to aid in your calculations. Compare your answer here to what would be obtained with
using the model E(taste) = 0 + 1Acetic as in project #1. Discuss why the differences in the
confidence intervals occur.
> predict(object = mod.fit, newdata = data.frame(Acetic = mean(cheese$Acetic),
H2S = mean(cheese$H2S), Lactic = mean(cheese$Lactic)),
level = 0.95, interval = "confidence")
fit
lwr
upr
1 24.53333 20.73142 28.33525
taste  24.53 and 20.73 < E(taste) < 28.34
When I use the model only with acetic acid as the predictor variable, I obtain
> predict(object = mod.fit1, newdata = data.frame(Acetic = mean(cheese$Acetic)),
level = 0.95, interval = "confidence")
fit
lwr
upr
1 24.53333 19.36438 29.70229
taste  24.53 and 19.36 < E(taste) < 29.70
Interestingly, the estimates of taste are the same (this will not usually occur in other
applications). The confidence interval for the model with 3 predictor variables is smaller in
width to the confidence interval with only the acetic acid predictor variable. This occurs
because hydrogen sulfide and lactic acid are important predictor variables (see t-test results).
Thus, more variability in the response is explained by the model with 3 predictor variables
rather than only 1. This causes the SSE and then MSE to be smaller, which then causes the
confidence interval to be smaller in width for the three predictor variable model.
h) (5 points) Using examine.mod.multiple() and the original model for this problem, comment on
1) linearity of the regression model, 2) constant error variance, 3) outliers, and 4) normality of
. You only need to include the plots that allow you to asses these 4 items here in order to
save space.
> save.it<-examine.mod.multiple(mod.fit.obj = mod.fit, const.var.test = TRUE,
boxcox.find = TRUE)
*
ei vs. estimated mean response
2
1
0
-3
-2
-1
Semistud. residuals
10
0
-10
Residuals
20
3
Residuals vs. estimated mean response
0
10
20
30
40
Estimated mean response
50
0
10
20
30
40
50
Estimated mean response
5
Residuals vs. predictor 2
Histogram of semistud. residuals
0.3
0.1
0.2
Density
10
0.0
-10
-10
0
Residuals
10
20
20
Residuals vs. observation number
0
Residuals
10
0
-10
Residuals
20
Residuals vs. predictor 1
4.5
5.0
5.5
6.0
6.5
4
Predictor variable 1
6
8
10
0
5
Predictor variable 2
10
15
20
25
30
-2
-1
Observation number
Residuals vs. predictor 3
0
1
2
3
Semistud. residuals
Normal Q-Q Plot
Box-Cox transformation plot
1.4
1.6
1.8
2.0
Predictor variable 3
-50
-150
-100
1
log-Likelihood
2
1.2
0
Semistud. residuals
1.0
-1
10
0
-10
Residuals
20
95%
-2
-1
0
1
2
-2
-1
0
1
2
Theoretical Quantiles
> save.it$levene
[,1]
[,2]
[,3]
[1,]
1 0.8407 0.36702910
[2,]
2 3.5560 0.06974367
[3,]
3 3.1419 0.08718428
> save.it$bp
Breusch-Pagan test
data: mod.fit.obj
BP = 4.494, df = 3, p-value = 0.2128
> save.it$lambda.hat
[1] 0.67
i) Linearity of the regression model
Plots of the residuals versus each of the predictor variables contain a random scattering of
points. Therefore, no transformations are suggested by the plots.
ii) Constant error variance
The plot of ei vs. Ŷi contains a random scattering of points. There is not any indication of
non-constant error variance. The BP test results in a p-value of 0.21 indicating there is not
sufficient evidence to indicate non-constant error variance. The Levene’s test results in a
large p-value of 0.37 for acetic acid and two moderately small p-values for hydrogen sulfide
(p-value = 0.07) and lactic acid (0.09). Thus, there may be problems with the non-constant
error variance assumption, but the evidence is not strong. The 95% confidence interval for
 from the Box-Cox transformation appears to have an upper bound of approximately 1.
6
Also, ̂ = 0.67. Thus, there again is some evidence of non-constant error variance, but it is
not strong.
iii) Outliers
The plot of ei vs. Ŷi does not have any points outside of ei = –3 and 3. There are not any
outliers as indicated by the plots.
iv) Normality of i
The QQ-plot of ei has most of points lying on the straight line, with a few deviations in the
right side. Overall, this plot does not provide sufficient evidence against normality to
warrant a change in the model. The histogram has a somewhat mound shape like a normal
distribution, but the left tail does not have any observations. However, with only a sample
size of the 30, this is not necessarily surprising. Overall, this plot also does not provide
sufficient evidence against normality to warrant a change in the model.
i) Because the t-test for acetic acid had a large p-value, the “backward elimination” model
building procedure would remove this variable from the overall model. Using the model
E(taste) = 0 + 1H2S + 2Lactic, complete the following.
i) (2 points) Find the sample regression model.
> mod.fit2<-lm(formula = taste ~ H2S + Lactic, data = cheese)
> summary(mod.fit2)
Call:
lm(formula = taste ~ H2S + Lactic, data = cheese)
Residuals:
Min
1Q
-17.343 -6.530
Median
-1.164
3Q
4.844
Max
25.618
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -27.592
8.982 -3.072 0.00481 **
H2S
3.946
1.136
3.475 0.00174 **
Lactic
19.887
7.959
2.499 0.01885 *
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.942 on 27 degrees of freedom
Multiple R-squared: 0.6517,
Adjusted R-squared: 0.6259
F-statistic: 25.26 on 2 and 27 DF, p-value: 6.551e-07
taste  27.59  3.946H2S  19.887Lactic
ii) (4 points) Plot the sample regression model using the scatter3d() function. Include a few
different angles of the plot in your answer. Interpret the effects of hydrogen sulfide and
lactic acid using the plot.
> library(Rcmdr)
> scatter3d(formula = taste ~
bg="white", grid=TRUE)
H2S + Lactic, data = cheese, fit="linear",
7
As hydrogen sulfide and lactic acid increase, the estimated value for taste increases.
8