Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ACMS 30600 Homework #4 Solutions Total Points Possible: 2+14+4+6+4+14=44 More feedback on common errors and typical deductions is present in a separate document. I. For the saving data, compute the variance inflation factor (VIF) for pop75. Paste the R commands you used to do this (about 2 lines of code should be enough). Assuming this is the largest VIF for any of the predictors, would you conclude there is significant multicollinearity present in the model? [2 points] mod.pop75<-lm(pop75~pop15+dpi+ddpi) 𝑅𝑅 2 is 0.8492. Then, the VIF is 1/(1-0.8492)=6.6313. Since this is not greater than the rule-of-thumb threshold 10, we conclude there is not a severe multicollinearity problem. 2 points: 1 point for R code & correct VIF; 1 point for correct conclusion on lack of multicollinearity II. PCA Task Solutions [Total = 14 points] 1. 2 points > X.scale<-scale(cbind(gro,im,tfr,gdp)) > pca<-prcomp(X.scale) 2. 2 points gro im tfr gdp PC1 0.49666 0.530386 0.536386 -0.42931 3. 1 point—Report the computational result of loading. Don’t need to show calculation since didn’t cover in lecture. >sum(0.49666*X.scale[1,1]+0.530386* X.scale[1,2]+0.536386* X.scale[1,3]+-0.42931* X.scale[1,4])= -0.2122266 Note: A good number of students skipped this problem and said professor said to skip, but since over half the class either did the full problem or did this as directed it counted as 1 point. Showing similar work, or a formula or a verbal explanation is also acceptable. 4. 2 points >screeplot(pca, type="lines") The elbow in the plot looks to be about at PC2. Thus, we include the first two PCs in the model. Some might say include the first three PCs, this is also justifiable. 5. 2 points > mod2<-lm(life~pca$x[,1]+pca$x[,2]) 6. 2 points The 𝑅𝑅 2 and adjusted-𝑅𝑅 2 of the model with two PCs is slightly lower than that of the model with all four original predictors. 7. 3 points; 1 for correlation, 2 for explanation > cor(pca$x[,1],pca$x[,2]) [1] 9.055709e-17 The correlation between PCs calculated from a given set of 𝑥𝑥 variables is always zero (R computes them numerically, so it give a tiny number very close to zero). Using PC's in the model instead of the original x's would eliminate any effects of multicollinearity from the model, since the PCs are uncorrelated. Thus, the results from a model using PC's might be easier to interpret than a model using the original x's if those x's are correlated. Note: A lot of students had a different correlation value that was extremely small as well; full credit was given to those students III. Durbin-Watson R Activity [4 points] 1. 4 points for code and correct answer my.DW<-function(x,y){ my.n<-length(x) my.lm<-lm(y~x) my.SSE<-sum(my.lm$residuals^2) my.numerator<-sum((my.lm$residuals[2:my.n]my.lm$residuals[1:(my.n-1)])^2) my.d<-my.numerator/my.SSE } 2. 0.3719777. Should obtain this exact value; otherwise the code wasn’t appropriately tested. IV. Statistical Inference on a Contingency Table [6 points] 1. 2 points 𝜃𝜃� =(10*448)/(87*255)= 0.2019382 Or, 𝜃𝜃� =(87*255)/(10*448)= 4.952009 2. 2 points >log(0.2019382)-qnorm(0.975,0,1)*sqrt((1/10)+(1/255)+(1/87)+(1/448)) [1] -2.272058 log(0.2019382)+qnorm(0.975,0,1)*sqrt((1/10)+(1/255)+(1/87)+(1/448)) [1] -0.9275289 Note: This is natural log, not base-10 log. Apologies for any confusion. Then, taking the exponent of each bound, we obtain > exp(-2.272058);exp(-0.9275289) [1] 0.1030998 [1] 0.3955299 Or, equivalently, using the second odds ratio in part (2), > log(4.952009)-qnorm(0.975,0,1)* sqrt((1/10)+(1/255)+(1/87)+(1/448)) [1] 0.9275287 > log(4.952009)+qnorm(0.975,0,1)* sqrt((1/10)+(1/255)+(1/87)+(1/448)) [1] 2.272058 > exp(0.9275287);exp(2.272058) [1] 2.528253 [1] 9.699342 3. 2 points Since the intervals do not include 1 (or for the interval prior to exponentiation, 0), we conclude SES and delinquency are associated. V. Fit a logistic regression for the SES data, and interpret the coefficients. Delinquent status is the response. [4 points; 2 for correct fitting of model, 2 for correct coefficient interpretation] This is very similar to the heart attack example in the class notes. The AZT example shouldn’t be the point of reference since there are 4 groupings with the yes/no in that example, rather than two. #########Heart Attack data for Logistic Regression ##Code converting the contingency table counts to variables in vector form SES<-c(0,1) Delinq<-c(87,10) n<-c(535,265) #Fit a logistic regression model using the glm() function mod.ses<-glm((Delinq/n)~SES,family=binomial,weights=n) summary(mod.ses) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6389 0.1172 -13.988 < 2e-16 *** SES -1.5998 0.3430 -4.664 3.1e-06 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3.0830e+01 on 1 degrees of freedom Residual deviance: 1.7208e-13 on 0 degrees of freedom AIC: 14.247 Interpretation of SES Coefficient: Being from high SES is associated with an decrease of 1.5998 in the log odds of developing delinquency. The odds ratio of delinquency for high SES vs. low is exp (-1.5998). Note: Points were deducted for the use of terms such as “factor” or “probability.” Odds and log odds are both distinct from probability; correct use of one of those terms was necessary for full credit. For example, for this example, high SES doesn’t result in a 1.5998* decrease in probability…it is closer to exp(1.5998) [conversion from log to level scale], but even that isn’t completely precise. Best bet is to keep in log odds or odds. VI. Titanic Data (14 points) 1. gender<-c(1,1,0,0) class<-c(1,0,1,0) survive<-c(57,89,140,156) n.i<-c((57+118),(89+541),(140+4),(156+102)) mod1<-glm(survive/n.i~gender+class,family=binomial,weights=n.i) 2. 4 points The log odds of survival are 2.5740 lower for men than women. The log odds of survival are 1.5158 higher for upper class than lower class. Or, the odds ratio of survival for men vs. women is exp(-2.5740); the odds ratio of survival for upper vs. lower class is exp(1.5158). 3. 4 points 1st Class Male: exp(0.6125- 2.5740*1+ 1.5158*1)/(1+exp(0.6125- 2.5740*1+ 1.5158*1)) 1st Class Female: exp(0.6125- 2.5740*0+ 1.5158*1)/(1+exp(0.6125- 2.5740*0+ 1.5158*1)) Lower Class Male: exp(0.6125- 2.5740*1+ 1.5158*0)/(1+exp(0.6125- 2.5740*1+ 1.5158*0)) Lower Class Female: exp(0.6125- 2.5740*0+ 1.5158*0)/(1+exp(0.6125- 2.5740*0+ 1.5158*0)) Or, access these values using mod1$fitted.values or the predict() function. Note: Only 1 of 4 points were given for the correct values if no work or supporting code was shown. 4. 1-sum((survive/n.i-mod1$fitted.values)^2)/sum((survive/n.imean(survive/n.i))^2) [1] 0.9678833 5. pchisq(mod1$deviance,mod1$df.residual,lower=F) [1] 7.917697e-06 P-value is less than 0.05, so reject the null hypothesis. Conclude this model is not a good fit to the data. Note: Null hypothesis of the test is that the model fits the data well…a lot of students found the reverse conclusion.