Download assignment 2.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Forecasting wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
STAC67H3: Regression Analysis
Assignment # 02 Solution
November 13, 2015
Problem 1: Exercise 3.10 Per capita earnings Page 149 [20 Marks] A sociologist employed linear
regression model (2.1) to relate per capita earnings (Y ) to average number of years of schooling (X) for 12
cities. The fitted values Ŷi and the semistudentized e∗i follow.
i:
Ŷi :
e∗i :
1
9.9
-1.12
2
9.3
0.81
···
···
···
3
10.2
-0.76
10
15.6
-3.78
11
11.2
0.74
12
13.1
0.32
2
a. [10 Marks] Plot the semistudentized residuals against the fitted values. What does the plot suggest?
●
1
●
●
●
●
●
0
●
−1
●
−2
●
−3
Semistudentized Residuals
●
●
−4
●
9
10
11
12
13
14
15
Fitted Values
Figure 1: Plot of the semistudentized residuals against the fitted values.
Most of the semistudentized residuals are around zero line, suggesting a linear fit might be appropriate.
One of the residuals lies outside of the interval (−3, 3) suggesting the presence of a potential outlying
observation.
## R Codes ##
p1
yhat <- p1[,1]
studres <- p1[,2]
## Data: Per capita earnings
## Fitted Values
## Semistudentized Residuals
1
plot(yhat, studres, xlab = "Fitted Values", ylab = "Semistudentized Residuals")
abline(h = 0, lty = 2)
b. [10 Marks] How many semistudentized residuals are outside ±1 standard deviation? Approximately
how many would you expect to see if the normal error model is appropriate?
Four of the 12 (.33%) residuals are outside of ±1 standard deviation. We expect 68% of the residuals
fall within one standard deviation from the mean, here the mean of the residuals is zero. In this
example 67% of the residuals are within one standard deviation from the mean. Hence, the normal
error model is reasonable.
## R Codes ##
sum((studres < - 1) | (studres > 1))
mean((studres < - 1) | (studres > 1))
1 - mean((studres < - 1) | (studres > 1))
Problem 2: Exercise 3.11 Drug concentration – Page 149 [25 Marks] A pharmacologist employed
linear regression model (2.1) to study the relation between the concentration of a drug in plasma (Y ) and
the log-dose of the drug (X). The residuals and log-doge levels follow.
i:
Xi :
ei :
1
-1
0.5
2
0
2.1
3
1
-3.4
4
-1
0.3
5
0
-1.7
6
1
4.2
7
-1
-0.6
8
0
2.6
9
1
-4.0
a. [10 Marks] Plot the residuals ei against Xi . What conclusions do you draw from the plot?
4
●
●
●
●
0
Residuals
2
●
●
−2
●
−4
●
●
−1.0
−0.5
0.0
0.5
1.0
Predictor Variable
Figure 2: Plot of the residuals against predictor variable.
The residuals are around zero line, suggesting a linear fit might be appropriate. However, the variability
of residuals is larger for larger values of the predictor variable suggesting an apparent violation of the constant
variance assumption.
2
## R Codes ##
x <- c(-1, 0, 1, -1, 0, 1, -1, 0, 1)
e <- c(0.5, 2.1, -3.4, 0.3, -1.7, 4.2, -0.6, 2.6, -4.0)
## predictor variable
## residuals
plot(x, e, xlab = "Predictor Variable", ylab = "Residuals")
abline(h = 0, lty = 2)
b. [15 Marks] Assume that (3.10) is appropriate and conduct the Breusch-Pagan test to determine whether
or not the error variance varies with log-dose of the drug (X). Use α = 0.05. State the alternatives,
decision rule, and conclusion. Does your conclusion support your preliminary findings in part (a)?
The error sum of squares (SSE) of the model
Yi = β0 + β1 Xi + i
is SSE = 59.96. Consider the following model
log σi2 = γ0 + γ1 Xi
and the hypotheses H0 : γ1 = 0 and HA : γ1 6= 0. The alternative hypothesis states that the constant
variance assumption does not hold for the first model. The second provides regression sum of squares as
SSR∗ = 330.04. Hence, the BP test statistic
2
SSE
SSR∗
2
/
= 3.717906.
χBP =
2
n
Decision Rule: At α = 0.05 level of significance, the null hypothesis is rejected if
χ2BP ≥ χ2 (0.95, 1) = 3.84.
Conclusion: The null hypothesis can’t be rejected at 0.05 level of significance. The parts (a) and (b) give
differing results: Perhaps for the small sample size.
## R Codes ##
n <- length(x)
SSE <- sum(e^2)
reg <- lm(e^2 ~ x)
anova(reg)
SSR <- 330.04
chiBP <- (SSR/2)/(SSE/n)^2
chiBP
Problem 3: Exercise 3.17 Sales growth – Page 150 [45 Marks] A marketing researcher studied annual
sales of a product that had been introduced 10 years ago. The data are as follows, where X is the year
(coded) and Y is sales in thousands of units:
i:
Xi :
Yi :
1
0
98
2
1
135
3
2
162
4
3
178
5
4
221
6
5
232
7
6
283
8
7
300
9
8
374
10
9
395
a. [5 Marks] Prepare a scatter plot of the data. Does a linear relation appear adequate here?
3
400
●
300
●
250
●
●
●
200
Sales in thousands of units
350
●
●
150
●
100
●
●
0
2
4
6
8
Year
Figure 3: Plot of the response against predictor variable.
More or less, a linear relation appears to be reasonable for the two variables.
## R Codes ##
x <- c(0:9)
y <- c(98, 135, 162, 178, 221, 232, 283, 300, 374, 395)
plot(x, y, xlab = "Year", ylab = "Sales in thousands of units")
reg <- lm(y ~ x)
abline(reg)
b. [15 Marks] Use the Box-Cox procedure and standardization (3.36) to find an appropriate power transformation of Y . Evaluate SSE for λ = 0.3, 0.4, 0.5, 0.6, 0.7. What transformation of Y is suggested?
The Box-Cox procedure is performed using R (the codes are appended below). The value of λ = 0.5
minimized the residual sum of squares (SSE). Hence, a square root transformation of the original response
Y seems reasonable.
## R Codes ##
resSS <- function(x, y, lambda){
n <- length(y)
k2 <- (prod(y))^(1/n)
k1 <- 1/(lambda * (k2^(lambda - 1)))
w <- rep(NA, n)
for(i in 1:n){
w[i] <- ifelse(lambda == 0, (k2 * log(y[i])), (k1 * (y[i]^lambda - 1)))
}
reg_fit <- lm(w ~ x)
SSE <- deviance(reg_fit)
return(SSE)
}
lambda = c(0.3, 0.4, 0.5, 0.6, 0.7)
4
1100
1050
SSE
1000
950
0.3
0.4
0.5
0.6
0.7
λ
Figure 4: Plot of the SSE against λ.
SSE <- rep(NA, length(lambda))
for(i in 1:length(lambda)){
SSE[i] <- resSS(x, y, lambda[i])
}
plot(lambda, SSE, type = "l", lty = 1, lwd = 4, xlab = expression(lambda), ylab = "SSE")
√
c. [10 Marks] Use the transformation Y 0 = Y and obtain the estimated linear regression function for
the transformed data.
The square root transformation is obtained using R (please see the codes below). The estimated regression
function for the transformed data is
Ŷi0 = 10.26 + 1.08Xi .
## R Codes ##
yp <- sqrt(y)
regt <- lm(yp ~ x)
summary(regt)
d. [5 Marks] Plot the estimated regression line and the transformed data. Does the regression line appear
to be a good fit to the transformed data?
The estimated regression line for the transformed data shows a better fit than that is obtained in part
(a).
## R Codes ##
plot(x, yp, xlab = "Year", ylab = "Squared transformation of sales")
abline(regt)
e. [5 Marks] Obtain the residuals and plot them against the fitted values. Also prepare a normal probability plot. What do your plots show?
5
20
●
18
●
16
●
●
14
Squared transformation of sales
●
●
●
12
●
10
●
●
0
2
4
6
8
Year
1
2
3
Figure 5: Estimated regression line for the transformed data.
●
0
●
●
●
●
●
●
−3
−2
−1
Rediduals
●
●
●
10
12
14
16
18
20
Fitted Values
Figure 6: Plot of the residuals against fitted values for the transformed data.
From the plot of residuals against fitted values for the transformed data, we see that the values are nicely
scattered around the zero line suggesting a linear fit for the transformed data is reasonable.
The residuals are not in a straight line against the expected residuals. The normal probability plot is
suggesting weak to moderate departure of the residuals from normality.
## R Codes ##
## plot of residuals for against fitted values ##
e <- residuals(regt)
yhat <- fitted(regt)
6
0.6
0.4
●
●
●
0.2
●
0.0
Residuals
●
●
−0.2
●
−0.4
●
●
−0.6
●
−0.6
−0.4
−0.2
0.0
0.2
0.4
0.6
Expected Score
Figure 7: Normal probability plot of the residuals for the transformed data.
plot(yhat, e, xlab = "Fitted Values", ylab = "Rediduals", ylim = c(-3, 3))
abline(h = 0, lty = 2)
## normal-probability plot ##
n <- length(e)
MSE <- sum(e^2)/(n - 2)
RankofRes <- rank(e)
Zscore <- qnorm((RankofRes - 0.375)/(n + 0.25))
ExpRes <- Zscore * sqrt(MSE)
plot(ExpRes, e, xlab = "Expected Score", ylab = "Residuals", xlim = c(-0.6, 0.6), ylim = c(-0.6, 0.6))
abline(a = 0, b = 1)
f. [5 Marks] Express the estimated regression function in the original units.
Figure 8 shows the plot of the estimated regression function in original units. The estimated regression
function is clearly a good fit to the original data.
## R Codes ##
yhat.ori <- (yhat)^2
x <- c(0:9)
y <- c(98, 135, 162, 178, 221, 232, 283, 300, 374, 395)
plot(x, y, xlab = "Year", ylab = "Sales in thousands of units")
points(x, yhat.ori, type = "l")
Problem 4: Exercises 4.6 – Page 172 [25 Marks] Refer to Muscle mass Problem 1.27.
a. [10 Marks] Obtain Bonferroni joint confidence intervals for β0 and β1 , using a 99 percent family
confidence coefficient. Interpret your confidence intervals.
7
400
●
300
●
250
●
●
●
200
Sales in thousands of units
350
●
●
150
●
100
●
●
0
2
4
6
8
Year
Figure 8: Plot of the estimated regression function in original units.
The 100(1 − α) Bonferroni joint confidence intervals for β0 and β1 is
b0 ± Bs(b0 )
b1 ± Bs(b1 )
where B = t(1 − α/4; n − 2).
Hence, the 99% Bonferroni joint confidence intervals for β0 and β1 is
(−1.45, −0.93)
(140.26, 172.43)
where b0 = 156.346564, b1 = −1.189996, s(b0 ) = 5.51226249, s(b1 ) = 0.09019725 and B = t(1 − α/4; n − 2) =
2.918394.
In repeated sampling of the data of same size, the 99% of the estimates of β0 and β1 will be in between
(140.26, 172.43) and (−1.45, −0.93), respectively.
## R Codes ##
x <- MusMass[,2]
y <- MusMass[,1]
n <- length(x)
reg <- lm(y ~ x)
b <- coef(reg)
se <- sqrt(diag(vcov(reg)))
alpha <- 0.01
tval <- qt(1 - alpha/4, n - 2)
lower.lim <- b - tval * se
lower.lim
upper.lim <- b + tval * se
upper.lim
b. [10 Marks] Will b0 and b1 tend to err in the same direction or in opposite direction here? Explain.
8
The covariance between b0 and b1 can be expressed as
σ(b0 , b1 ) = −X̄σ 2 (b1 ).
If X̄ is positive, b0 and b1 are negatively correlated, implying that if the estimate b1 is too high, the estimate
b0 is likely to be too low, and vice versa. In this example X̄ = 59.98; hence the covariance between b0 and
b1 is negative. This implies that the estimators b0 and b1 here tend to err in opposite directions.
c. [5 Marks] A researcher has suggested that β0 should equal to approximately 160 and that β1 should
be between −1.9 and −1.5. Do the joint confidence intervals in part (a) support this expectation?
The suggested values for β1 are outside of the confidence limits obtained in part (a). So the joint
confidence intervals do not support this expectation.
Problem 5: Exercises 5.5 Consumer finance (page 210). The data below show, for a consumer finance
company operating in six cities, the number of competing loan companies operating in the city (X) and the
number per thousands of the company’s loans made in that city that are currently delinquent (Y ).
i:
Xi :
Yi :
1
4
16
2
1
5
3
2
10
4
3
15
5
3
13
6
4
22
Assume that first-order regression model (2.1) is applicable. Using matrix methods, find (1) Y0 Y, (2)
X X, (3) X0 Y. [3 × 5 = 15 Marks]
Here,
Y0 Y = 1259.
0
6 17
XX=
.
17 55
81
0
XY=
.
261
0
## R Codes ##
x <- c(4, 1, 2, 3, 3, 4)
y <- c(16, 5, 10, 15, 13, 22)
Y <- matrix(y, ncol = 1)
X <- cbind(1, x)
YY <- t(Y) %*% Y
YY
XX <- t(X) %*% X
XX
XY <- t(X) %*% Y
XY
Exercises 5.13 (page 211) Refer to Consumer finance Problem 5.5 . Find (X0 X)
Here,
1
55 −17
−1
.
(X0 X) =
−17
6
41
−1
. [10 Marks]
Exercises 5.24 (page 212) Refer to Consumer finance Problems 5.5 and 5.13. [60 Marks]
9
a. [7 × 5 = 35 Marks] Using matrix methods, obtain the following: (1) vector of estimated regression
coefficients, (2) vector of residuals, (3) SSR, (4) SSE, (5) estimated variance-covariance matrix of b,
(6) point estimate of E(Yh ) when Xh = 4, (7) s2 {pred} when Xh = 4.
Here, (1) the vector of estimated regression coefficients is
0.4390244
−1
b = (X0 X) X0 Y =
.
4.6097561
(2) the vector of residuals is


−2.87804878
−0.04878049
0.34146341
0.73170732
−1.26829268
3.12195122



e = Y − Xb = 






.



(3) the regression sum of squares
SSR = b0 X0 Y = 1238.707.
(4) the error sum of squares
0
SSE = e0 e = (Y − Xb) (Y − Xb) = 20.29268.
(5) the estimated variance-covariance matrix of b
0
Var-Cov(b) = M SE X X
−1
=
6.805473
−2.103510
−2.1035098
0.7424152
.
(6) point estimate of E(Yh ) when Xh = 4
dh ) = Xh b = 18.87805.
E(Y
(7) s2 {pred} when Xh = 4.
−1
s2 {pred} = 1 + X0h (X0 X) Xh M SE = 6.929209.
b. [3 × 5 = 15 Marks] From your estimated variance-covariance matrix in part (a5), obtain the following
(1) s{b0 , b1 }; (2) s2 {b0 }; (3) s{b1 }.
√
Here, (1) s{b0 , b1 } = −2.1035098; (2) s2 {b0 } = 6.805473; (3) s{b1 } = 0.7424152.
c. [5 Marks] Find the hat matrix H.
Here,

H = X (X0 X)
−1



0
X =



0.36585366 −0.1463415 0.02439024
−0.14634146 0.6585366 0.39024390
0.02439024
0.3902439 0.26829268
0.19512195
0.1219512 0.14634146
0.19512195
0.1219512 0.14634146
0.36585366 −0.1463415 0.02439024
d. [5 Marks] Find s2 {e}.
10
0.1951220
0.1219512
0.1463415
0.1707317
0.1707317
0.1951220
0.1951220
0.1219512
0.1463415
0.1707317
0.1707317
0.1951220
0.36585366
−0.14634146
0.02439024
0.19512195
0.19512195
0.36585366








Here,




2
s {e} = (I − H) M SE = 



3.2171327
0.7424152
0.7424152
1.7323022
−0.1237359 −1.9797739
−0.9898870 −0.6186794
−0.9898870 −0.6186794
−1.8560381 0.7424152
−0.1237359
−1.9797739
3.7120761
−0.7424152
−0.7424152
−0.1237359
−0.9898870
−0.6186794
−0.7424152
4.2070196
−0.8661511
−0.9898870
−0.9898870
−0.6186794
−0.7424152
−0.8661511
4.2070196
−0.9898870
−1.8560381
0.7424152
−0.1237359
−0.9898870
−0.9898870
3.2171327
## R Codes ##
b <- solve(XX) %*% XY
b
e <- Y - X %*% b
e
SSR <- t(b) %*% t(X) %*% Y
SSR
SSE <- t(e) %*% e
SSE
MSE <- SSE/4
VarCovB <- solve(XX) * c(MSE)
VarCovB
Xh <- c(1, 4)
YhHat <- Xh %*% b
YhHat
s2YhPred <- (1 + t(Xh) %*% solve(XX) %*% Xh) * c(MSE)
s2YhPred
H <- X %*% solve(XX) %*% t(X)
H
I <- matrix(0, nrow = dim(H)[[1]], ncol = dim(H)[[1]])
diag(I) <- 1
s2e <- (I - H) * c(MSE)
s2e
Problem 6: Exercises 6.15 Patient satisfaction (page 250) [65 Marks] A hospital administrator wished
to study the relation between patient satisfaction (Y ) and patient’s age (X1 , in years), severity of illness
(X2 , an index), and anxiety level (X3 , an index). The administrator randomly selected 46 patients and
collected the data presented below, where larger values of Y , X2 , and X3 are, respectively, associated with
more satisfaction, increased of severity of illness, and more anxiety.
i
X1i
X2i
X3i
Yi
:
:
:
:
:
1
50
51
2.3
48
2
36
46
2.3
57
3
40
48
2.2
66
···
···
···
···
···
44
45
51
2.2
68
45
37
53
2.1
59
46
28
46
1.8
92
a. [5 Marks] Prepare a stem-and-leaf plot for each of the predictor variables. Are any noteworthy features
revealed by these plots?
The stem-and-leaf plots for X1 , X2 and X3 are produced below. None of the plots revealed any noteworthy
features: No outlying observations, no special distributional shapes etc. However, one can get similar
information from histograms too.
11








2 | 23
2 | 58899999
3 | 012233344
3 | 66678
4 | 0012233344
4 | 557779
5 | 0233
5 | 55
Stem-and-leaf plot for X1, patient’s age.
40 | 0
42 | 00
44 | 0
46 | 00000
48 | 0000000000
50 | 000000000000
52 | 000000
54 | 0000
56 | 00
58 | 0
60 | 0
62 | 0
Stem-and-leaf plot for severity of illness (X2, an index).
18 | 000000
20 | 00000000
22 | 0000000000000
24 | 00000000000
26 | 0000
28 | 0000
Stem-and-leaf plot for anxiety level (X3, an index).
## R Codes ##
y <- PS[, 1]
x1 <- PS[, 2]
x2 <- PS[, 3]
x3 <- PS[, 4]
stem(x1)
stem(x2)
stem(x3)
b. [10 Marks] Obtain the scatter plot matrix and the correlation matrix. Interpret these and state your
principal findings.
Table 1: Correlation matrix of Y , X1 , X2 and X3 .
Y
X1
X2
X3
Y
1.0000
-0.7868
-0.6029
-0.6446
X1
-0.7868
1.0000
0.5680
0.5697
12
X2
-0.6029
0.5680
1.0000
0.6705
X3
-0.6446
0.5697
0.6705
1.0000
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
x1
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
● ●
●
●
●
●● ●
●
●
●● ●
●
●
●
●
● ●●
●●
●
●
x2
●
●
●
●
●
●
●●
● ●●
●●
●
●
●●
●
●
●
●
●
●
1.8
●
●
●
●
●
●
●●●●
●
●
●
●
● ●
●
● ● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●●●
●●●●
●
●
●●●
●
●
●
●
●
●
●
●
● ● ●
● ●
●
●
● ●●
●
●
●
●● ●
●
●●
2.0
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
2.6
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●● ●
● ●
● ●
●
●●●
●
●
●
●
●
●
●
●
●
●
2.8
●
●
●
●
●
2.4
●
●
●
●
●
●
●
●
●
●
●●
●
●●
2.2
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
60
25 30 35 40 45 50 55
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2.8
●
●
●
●
2.6
●
●
●
●
●
●
●●
● ●
●●
2.4
●
●
●
●
●
2.2
●
55
●
●
2.0
●
50
●
●
●
●
●
●
● ●
● ●
●
●
●
45
●
●
●
●
●
●
●
●●
●
●●
● ●
●
y
1.8
●
●
●
●
●
30 40 50 60 70 80 90
25 30 35 40 45 50 55
●
●
●
x3
●●
●
●
●
●
●
●
30 40 50 60 70 80 90
●●
●
45
●
50
55
60
Figure 9: Scatter plot matrix of Y , X1 , X2 and X3 .
Figure 9 and Table 1 show the scatter plot and correlation matrix among Y , X1 , X2 and X3 . The
response variable Y is negatively correlated with each of the predictor variables X1 , X2 and X3 . The
predictor variables X1 , X2 and X3 are also moderately positively correlated which might introduce multicollinearity in the problem.
## R Codes ##
pairs(PS)
cor(PS)
c. [10 Marks] Fit regression model (6.5) for three predictor variables to the data and state the estimated
regression function. How is b2 interpreted here?
The estimated regression function is
d ) = 158.49 − 1.14X1 − 0.44X2 − 13.47X3 .
E(Y
Interpretation of b2 = −0.44: For one unit increase in severity of illness X2 , an average decrease in patient
satisfaction (Y) is −0.44 unit given patient’s age X1 and anxiety level X3 are held fixed.
## R Codes ##
attach(PS)
reg <- lm(y ~ x1 + x2 + x3, data = PS)
summary(reg)
d. [10 Marks] Obtain the residuals and prepare a box plot of the residuals. Do there appear to be any
outliers?
The box-plot of the residuals shows a symmetric distribution. No evidence of outliers is found.
## R Codes ##
e <- residuals(reg)
e
boxplot(e, ylab = "Residuals")
13
Table 2: Residuals
1
0.1129334
2
-9.0796538
3
4.0237858
···
···
45
-5.5380448
46
10.0523698
0
−15
−10
−5
Residuals
5
10
15
i:
ei :
Figure 10: Box-plot of the residuals.
e. [15 Marks] Plot the residuals against Ŷ , each of the predictor variables, and each two-factor interaction
term on separate graphs. Also prepare a normal probability plot. Interpret your plots and summarize
your findings.
The plots of the residuals against predictor variables do not show any striking pattern. The plot of the
residuals against fitted values shows that the constant variance assumption might not hold for this data set.
The normal probability plot shows that the residuals have longer tails in both directions.
## R Codes ##
yhat <- fitted(reg)
plot(yhat, e, xlab = "Fitted Values", ylab = "Residuals", ylim = c(-20, 20))
abline(h = 0, lty = 2)
plot(x1, e, xlab = "X1", ylab = "Residuals", ylim = c(-20, 20))
abline(h = 0, lty = 2)
plot(x2, e, xlab = "X2", ylab = "Residuals", ylim = c(-20, 20))
abline(h = 0, lty = 2)
plot(x3, e, xlab = "X3", ylab = "Residuals", ylim = c(-20, 20))
abline(h = 0, lty = 2)
install.packages("scatterplot3d")
library(scatterplot3d)
scatterplot3d(x1, x2, e, xlab = "X1", ylab = "X2", zlab = "Residuals")
14
●
●
●
●
●
10
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
●
0
Residuals
●
0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Residuals
●
●
●
●
●
●
10
10
●●
●
●
●
●
●
●
●
●
20
●
●
● ●
●
Residuals
(c) Residuals against X2 .
20
(b) Residuals against X1 .
20
(a) Residuals against Ŷ .
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
70
80
−20
●
−20
60
90
25
30
35
40
Fitted Values
45
50
55
45
50
X1
55
60
X2
(e) Residuals against X1 and X3 .
(f) Residuals against X1 and X3 .
20
(d) Residuals against X3 .
●
●
●
●
−20
●
50
●
●
●
●
●
●
●
●
●
40
●
−10
●
●
−10
−10
●
●
●
●
●
●
●
●
●
●
10
●
● ●
●
●
65
●
60
●
●
●
●
−10
●
●
●
●
●
●
●
●
2.8
●
2.6
●
50
●
●
●
●
●
●
●
2.4
2.2
●
●
45
2.0
1.8
2.0
2.2
2.4
2.6
−20
−20
−20
●
40
20
25
30
35
2.8
40
45
50
55
1.8
20
25
30
35
X1
40
45
X1
X3
(g) Residuals against X2 and X3 .
(h) Normal probability plot of the
residuals.
●
15
●
●
10
●
●
●
●●
●
5
3.0
●
●
●
● ●
●
● ●
●
●
●
●
−10
●
●
●
0
●
●●
−10
●
●●
●●
●
●
●●
●
●
2.6
●
●
●
2.8
●
0
2.4
●
●●
●
●
2.2
●
●
−20
2.0
45
50
55
60
●
●
●
●
1.8
40
●
−15
Residuals
●
●
●
●
Residuals
●
●
●●
●●
●
−5
10
●
● ● ●
●
●
●●
●
X3
20
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
65
−20
X2
−10
0
10
20
Expected Score
Figure 11: Plots of the residuals.
scatterplot3d(x1, x3, e, xlab = "X1", ylab = "X3", zlab = "Residuals")
scatterplot3d(x2, x3, e, xlab = "X2", ylab = "X3", zlab = "Residuals")
n <- length(e)
MSE <- sum(e^2)/(n - 4)
RankofRes <- rank(e)
15
●
3.0
●
●
●
55
●
●
●
●
●
●
−10
●
●
●
●
●
●
●
●
●
●
●
0
●
●
●
● ●
●
Residuals
10
●
●
−10
●
●
0
Residuals
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
20
●
●
●
●
●
●
●
●
●
●
50
55
X3
●
●
●
●
●
●
X2
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
0
●
●
●
●
●
Residuals
●
●
●
20
10
●
Zscore <- qnorm((RankofRes - 0.375)/(n + 0.25))
ExpRes <- Zscore * sqrt(MSE)
plot(ExpRes, e, xlab = "Expected Score", ylab = "Residuals")
abline(a = 0, b = 1)
f. [5 Marks] Can you conduct a formal test for lack of fit here?
In this data set the observations were not repeated. So a formal test for lack of fit is not recommended
here.
g. [10 Marks] Conduct the Breusch-Pagan test for constancy of the error variance, assuming log σi2 =
γ0 + γ1 Xi1 + γ2 Xi2 + γ3 Xi3 ; use α − 0.01. State the alternatives, decision rule, and conclusion.
Here, SSE and SSR∗ are 4248.841 and 21355.53, respectively. For n = 46, the test statistic is
χ2BP = (SSR∗ /2)/(SSE/n)2 = 1.25157,
which follows χ-squared distribution with 3 degrees of freedoms. At α = 0.01 level of significance, the
tabulated value is
χ2T AB = χ2 (1 − α; 3) = 11.34487.
The assumption of constant variance can’t be rejected at 0.01 level of significance as χ2BP < χ2T AB .
## R Codes ##
SSE <- sum(e^2)
SSE
n <- length(e)
reg2 <- lm(e^2 ~ x1 + x2 + x3)
y2hat <- fitted(reg2)
SSR2 <- sum((y2hat - mean(y2hat))^2)
SSR2
chiBP <- (SSR2/2) / (SSE/n)^2
chiBP
chiTAB <- qchisq(0.99, 3)
chiTAB
Exercises 6.16 (page 251) Refer to Patient satisfaction Problem 6.15 [25 Marks] Assume that regression
model (6.5) for three predictor variables with independent normal error terms is appropriate.
a. [10 Marks] Test whether there is a regression relation; use α = 0.10. State the alternatives, decision
rule, and conclusion. What does your test imply about β1 , β2 , and β3 ? What is the P -value of the
test?
Consider the multiple linear regression model with three predictors
Yi = β0 + β1 Xi1 + β2 Xi2 + β3 Xi3 + i .
In order to test if there is any regression relation, we want to test the null hypothesis
H0 : β1 = β2 = β3 = 0
against alternative
HA : at least one β is not equal to zero.
16
The test statistic is
FCAL = (SSR/3)/(SSE/n − 4) = 30.05,
which follows 3 and 42 degrees of freedoms. At α = 0.10 level of significance, the tabulated value is
FT AB = F (1 − α; 3, 42) = 2.219059
Decision rule: As FCAL > FT AB the null hypothesis is rejected at α = 0.10 level of significance. That
means at least one of the β’s are not equal to zero. The P -value of the test is 1.542−10 .
## R Codes ##
reg <- lm(y ~ x1 + x2 + x3)
summary(reg)
qf(0.90, 3, 42)
b. [10 Marks] Obtain joint interval estimates of β1 , β2 , and β3 , using a 90 percent family confidence
coefficient. Interpret your results.
The 100(1 − α)% Bonferroni joint confidence intervals for β1 , β2 and β3 are
b1 ± Bs{b1 }
b2 ± Bs{b2 }
b3 ± Bs{b3 }
where B = t(1 − α/6; n − 4).
Using R, the joint 90% confidence intervals are
(−1.614248, −0.6689755)
(−1.524510, 0.6405013)
(−29.092028, 2.1517012).
In repeated samples of size 46, the 90% estimates of β1 , β2 and β3 will be in (−1.614248, −0.6689755),
(−1.524510, 0.6405013) and (−29.092028, 2.1517012), respectively.
## R Codes ##
b <- summary(reg)$coef[, 1]
b
seb <- summary(reg)$coef[, 2]
seb
alpha <- 0.10
tval <- qt(1 - alpha/6, 42)
lower.lim <- b - tval * seb
lower.lim
upper.lim <- b + tval * seb
upper.lim
c. [5 Marks] Calculate the coefficient of multiple determination. What does it indicate here?
The coefficient of multiple determination is computed as
R2 = SSR/SST = 0.6822.
The 68.22% variability in the response variable, patient satisfaction, is explained by the fitted model.
## R Codes ##
R2 <- summary(reg)$r.squared
R2
#$
17
Exercises 6.17 (page 251) Refer to Patient satisfaction Problem 6.15 [20 Marks] Assume that regression
model (6.5) for three predictor variables with independent normal error terms is appropriate.
a. [10 Marks] Obtain an interval estimate of the mean satisfaction when Xh1 = 35, Xh2 = 45, and
Xh3 = 2.2. Use a 90 percent confidence coefficient. Interpret your confidence interval.
Here, the estimated mean regression is
E(Yd
|Xh ) = Xh b = 69.01029.
The estimated variance of the estimated mean regression is
−1
d
Var(Ŷ |Xh ) = M SE × XTh XT X
Xh = 7.100156.
Hence, the 100(1 − α) confidence interval of E(Y |Xh ) is
q
d
E(Yd
|Xh ) ± t(1 − α/2; n − 4) × Var(Ŷ |Xh ) = (64.52854, 73.49204).
In repeated sampling of size n = 46, 90 percent of the times the estimates of E(Y |Xh ) will be in
(64.52854, 73.49204).
## R Codes ##
X <- cbind(1, x1, x2, x3)
Y <- matrix(y, ncol = 1)
XX <- t(X) %*% X
XY <- t(X) %*% Y
b <- solve(XX) %*% XY
Xh <- matrix(c(1, 35, 45, 2.2), nrow = 1)
EYh <- Xh %*% b
EYh
VarEYh <- (Xh %*% solve(XX) %*% t(Xh)) * MSE
VarEYh
alpha <- 0.10
tval <- qt(1 - alpha/2, 42)
c(EYh - tval * sqrt(VarEYh), EYh + tval * sqrt(VarEYh))
b. [10 Marks] Obtain a prediction interval for a new patient’s satisfaction when Xh1 = 35, Xh2 = 45, and
Xh3 = 2.2. Use a 90 percent confidence coefficient. Interpret your confidence interval.
Here, the estimated prediction is
Ŷ |Xh = Xh b = 69.01029.
The estimated variance of the prediction error is
−1
d
Var(pred)
= M SE × 1 + XTh XT X
Xh = 108.263.
Hence, the 100(1 − α) confidence interval of Yh |Xh ) is
q
d
= (51.50965, 86.51092).
Ŷ |Xh ± t(1 − α/2; n − 4) × Var(pred)
In repeated sampling of size n = 46, 90 percent of the times the predictions of Yh |Xh will be in
(51.50965, 86.51092).
18
## R Codes ##
pred <- Xh %*% b
pred
VarPred <- (1 + (Xh %*% solve(XX) %*% t(Xh))) * MSE
VarPred
c(pred - tval * sqrt(VarPred), pred + tval * sqrt(VarPred))
19