Download ACMS 30600 Homework #4 Solutions Total Points Possible: 2+14+4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computer simulation wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Generalized linear model wikipedia , lookup

Transcript
ACMS 30600
Homework #4 Solutions
Total Points Possible: 2+14+4+6+4+14=44
More feedback on common errors and typical deductions is present in a separate document.
I.
For the saving data, compute the variance inflation factor (VIF) for pop75. Paste the R
commands you used to do this (about 2 lines of code should be enough). Assuming this is
the largest VIF for any of the predictors, would you conclude there is significant
multicollinearity present in the model? [2 points]
mod.pop75<-lm(pop75~pop15+dpi+ddpi)
𝑅𝑅 2 is 0.8492. Then, the VIF is 1/(1-0.8492)=6.6313. Since this is not greater than the rule-of-thumb
threshold 10, we conclude there is not a severe multicollinearity problem.
2 points: 1 point for R code & correct VIF; 1 point for correct conclusion on lack of multicollinearity
II.
PCA Task Solutions [Total = 14 points]
1. 2 points
> X.scale<-scale(cbind(gro,im,tfr,gdp))
> pca<-prcomp(X.scale)
2. 2 points
gro
im
tfr
gdp
PC1
0.49666
0.530386
0.536386
-0.42931
3. 1 point—Report the computational result of loading. Don’t need to show calculation since didn’t
cover in lecture.
>sum(0.49666*X.scale[1,1]+0.530386* X.scale[1,2]+0.536386*
X.scale[1,3]+-0.42931* X.scale[1,4])= -0.2122266
Note: A good number of students skipped this problem and said
professor said to skip, but since over half the class either did the
full problem or did this as directed it counted as 1 point.
Showing similar work, or a formula or a verbal explanation is also acceptable.
4. 2 points
>screeplot(pca, type="lines")
The elbow in the plot looks to be about at PC2. Thus, we include the first two PCs in the model. Some
might say include the first three PCs, this is also justifiable.
5. 2 points
> mod2<-lm(life~pca$x[,1]+pca$x[,2])
6. 2 points
The 𝑅𝑅 2 and adjusted-𝑅𝑅 2 of the model with two PCs is slightly lower than that of the model with all four
original predictors.
7. 3 points; 1 for correlation, 2 for explanation
> cor(pca$x[,1],pca$x[,2])
[1] 9.055709e-17
The correlation between PCs calculated from a given set of 𝑥𝑥 variables is always zero (R computes them
numerically, so it give a tiny number very close to zero). Using PC's in the model instead of the original
x's would eliminate any effects of multicollinearity from the model, since the PCs are uncorrelated. Thus,
the results from a model using PC's might be easier to interpret than a model using the original x's if
those x's are correlated.
Note: A lot of students had a different correlation value that was extremely small as well; full credit was
given to those students
III. Durbin-Watson R Activity [4 points]
1. 4 points for code and correct answer
my.DW<-function(x,y){
my.n<-length(x)
my.lm<-lm(y~x)
my.SSE<-sum(my.lm$residuals^2)
my.numerator<-sum((my.lm$residuals[2:my.n]my.lm$residuals[1:(my.n-1)])^2)
my.d<-my.numerator/my.SSE
}
2.
0.3719777. Should obtain this exact value; otherwise the code wasn’t appropriately tested.
IV. Statistical Inference on a Contingency Table [6 points]
1. 2 points
𝜃𝜃� =(10*448)/(87*255)= 0.2019382
Or,
𝜃𝜃� =(87*255)/(10*448)= 4.952009
2. 2 points
>log(0.2019382)-qnorm(0.975,0,1)*sqrt((1/10)+(1/255)+(1/87)+(1/448))
[1] -2.272058
log(0.2019382)+qnorm(0.975,0,1)*sqrt((1/10)+(1/255)+(1/87)+(1/448))
[1] -0.9275289
Note: This is natural log, not base-10 log. Apologies for any
confusion.
Then, taking the exponent of each bound, we obtain
> exp(-2.272058);exp(-0.9275289)
[1] 0.1030998
[1] 0.3955299
Or, equivalently, using the second odds ratio in part (2),
> log(4.952009)-qnorm(0.975,0,1)* sqrt((1/10)+(1/255)+(1/87)+(1/448))
[1] 0.9275287
> log(4.952009)+qnorm(0.975,0,1)* sqrt((1/10)+(1/255)+(1/87)+(1/448))
[1] 2.272058
> exp(0.9275287);exp(2.272058)
[1] 2.528253
[1] 9.699342
3. 2 points
Since the intervals do not include 1 (or for the interval prior to exponentiation, 0), we conclude SES and
delinquency are associated.
V. Fit a logistic regression for the SES data, and interpret the coefficients. Delinquent status is the
response. [4 points; 2 for correct fitting of model, 2 for correct coefficient interpretation]
This is very similar to the heart attack example in the class notes. The AZT example shouldn’t be the
point of reference since there are 4 groupings with the yes/no in that example, rather than two.
#########Heart Attack data for Logistic Regression
##Code converting the contingency table counts to variables in vector form
SES<-c(0,1)
Delinq<-c(87,10)
n<-c(535,265)
#Fit a logistic regression model using the glm() function
mod.ses<-glm((Delinq/n)~SES,family=binomial,weights=n)
summary(mod.ses)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6389 0.1172 -13.988 < 2e-16 ***
SES
-1.5998 0.3430 -4.664 3.1e-06 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3.0830e+01 on 1 degrees of freedom
Residual deviance: 1.7208e-13 on 0 degrees of freedom
AIC: 14.247
Interpretation of SES Coefficient: Being from high SES is associated with an decrease of 1.5998 in the
log odds of developing delinquency. The odds ratio of delinquency for high SES vs. low is exp (-1.5998).
Note: Points were deducted for the use of terms such as “factor” or “probability.” Odds and log odds are
both distinct from probability; correct use of one of those terms was necessary for full credit. For
example, for this example, high SES doesn’t result in a 1.5998* decrease in probability…it is closer to
exp(1.5998) [conversion from log to level scale], but even that isn’t completely precise. Best bet is to
keep in log odds or odds.
VI. Titanic Data (14 points)
1.
gender<-c(1,1,0,0)
class<-c(1,0,1,0)
survive<-c(57,89,140,156)
n.i<-c((57+118),(89+541),(140+4),(156+102))
mod1<-glm(survive/n.i~gender+class,family=binomial,weights=n.i)
2. 4 points
The log odds of survival are 2.5740 lower for men than women. The log odds of survival are 1.5158
higher for upper class than lower class.
Or, the odds ratio of survival for men vs. women is exp(-2.5740); the odds ratio of survival for upper vs.
lower class is exp(1.5158).
3. 4 points
1st Class Male:
exp(0.6125- 2.5740*1+ 1.5158*1)/(1+exp(0.6125- 2.5740*1+ 1.5158*1))
1st Class Female:
exp(0.6125- 2.5740*0+ 1.5158*1)/(1+exp(0.6125- 2.5740*0+ 1.5158*1))
Lower Class Male:
exp(0.6125- 2.5740*1+ 1.5158*0)/(1+exp(0.6125- 2.5740*1+ 1.5158*0))
Lower Class Female:
exp(0.6125- 2.5740*0+ 1.5158*0)/(1+exp(0.6125- 2.5740*0+ 1.5158*0))
Or, access these values using mod1$fitted.values or the predict() function.
Note: Only 1 of 4 points were given for the correct values if no work
or supporting code was shown.
4.
1-sum((survive/n.i-mod1$fitted.values)^2)/sum((survive/n.imean(survive/n.i))^2)
[1] 0.9678833
5.
pchisq(mod1$deviance,mod1$df.residual,lower=F)
[1] 7.917697e-06
P-value is less than 0.05, so reject the null hypothesis. Conclude this model is not a good fit to the data.
Note: Null hypothesis of the test is that the model fits the data well…a lot of students found the reverse
conclusion.