Download eNote 4

eNote 4 1 eNote 4 PCR, Principal Component Regression in R 2 eNote 4 INDHOLD Indhold 4 PCR, Principal Component Regression in R 4.1 Reading material . . . . . . . . . . . . . 4.2 Presentation material . . . . . . . . . . . 4.2.1 Motivating Example . . . . . . . 4.2.2 Example: Spectral type data . . . 4.2.3 Some more presentation stuff . . 4.3 Example: Car Data (again) . . . . . . . . 4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 3 14 20 24 33 4.1 Reading material PCR and the other biased regression methods presented in this course (PLS, Ridge and Lasso) are all together with even more methods (as e.g. MLR=OLS) introduced in each of the three books • The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome Friedman • Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the Natural Sciences and Life Sciences. Springer, Heidelberg.(Chapter 8 and 9) • K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Analysis in Chemometrics, CRC Press. (Chapter 4) The latter two ones are directly linked with R-packages, and here we will most directly use the latter. We give here a reading list for the most relevant parts of chapter 4 of the Varmuza and Filzmoser book, when it comes to syllabus content for course 27411: • Section 4.1 (4p) (Concepts - ALL models) eNote 4 4.2 PRESENTATION MATERIAL • Section 4.2.1-4.2.3 (6p)(Errors in (ALL) models ) • Section 4.2.5-4.2.6 (3.5p)(CV+bootstrap - ALL models) • [Section 4.3.1-4.3.2.1 (9.5p)(Simple regression and MLR (=OLS))] • [Section 4.5.3 (1p) (Stepwise Variable Selection in MLR)] • Section 4.6 (2p) (PCR) • Section 4.7.1 (3.5p) (PLS) • Section 4.8.2 (2.5p) (Ridge and Lasso) • Section 4.9-4.9.1.2 (5p) (An example using PCR and PLS) • Section 4.9.1.4-4.9.1.5 (5p) (An example using Ridge and Lasso) • Section 4.10 (2p) (Summary) 4.2 Presentation material What is PCR? (PCR = PCA + MLR) • NOT: Polymerase Chain Reaction • A regression technique to cope with many x-variables • Situation: Given Y and X-data: • Do PCA on the X-matrix – Defines new variables: the principal components (scores) • Use some of these new variables in an MLR to model/predict Y • Y may be univariate OR multivariate: In this course: only UNIVARIATE. 4.2.1 Motivating Example 3 eNote 4 4 4.2 PRESENTATION MATERIAL # Simulation of data: set.seed(123) y <x1 <x2 <x3 <x4 <data1 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = <- matrix(c(x1, x2, 0.2) 0.2) 0.2) 0.2) 0.2) x3, x4, y), ncol = 5, byrow = F) Multiple linear regression and stepwise removal of variables, manually: # For data1: (The right order will change depending on the simulation) res <- lm(y ~ x1 + x2 + x3 + x4) summary(res) Call: lm(formula = y ~ x1 + x2 + x3 + x4) Residuals: 1 2 -0.09790 -0.06970 3 0.25359 4 5 0.00534 -0.06863 6 7 0.05346 -0.07617 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.007944 0.264401 -0.030 0.979 x1 0.397394 0.715631 0.555 0.635 x2 0.688683 1.336202 0.515 0.658 x3 0.583640 0.463201 1.260 0.335 x4 -0.612946 1.992878 -0.308 0.787 Residual standard error: 0.2146 on 2 degrees of freedom Multiple R-squared: 0.997,Adjusted R-squared: 0.9909 F-statistic: 164.4 on 4 and 2 DF, p-value: 0.006055 res <- lm(y ~ x2 + x3 + x4) summary(res) eNote 4 5 4.2 PRESENTATION MATERIAL Call: lm(formula = y ~ x2 + x3 + x4) Residuals: 1 2 -0.080466 -0.139415 3 0.249587 4 0.036931 5 0.008603 6 7 0.045685 -0.120925 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.05461 0.20983 0.260 0.812 x2 0.15563 0.81535 0.191 0.861 x3 0.59238 0.40608 1.459 0.241 x4 0.27043 1.05296 0.257 0.814 Residual standard error: 0.1883 on 3 degrees of freedom Multiple R-squared: 0.9965,Adjusted R-squared: 0.993 F-statistic: 284.8 on 3 and 3 DF, p-value: 0.000351 res <- lm(y ~ x2 + x3) summary(res) Call: lm(formula = y ~ x2 + x3) Residuals: 1 2 -0.102758 -0.128568 3 0.245607 4 0.074246 5 0.005447 6 7 0.028251 -0.122225 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02487 0.15319 0.162 0.8789 x2 0.35574 0.21039 1.691 0.1661 x3 0.67922 0.19692 3.449 0.0261 * --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.1648 on 4 degrees of freedom Multiple R-squared: 0.9964,Adjusted R-squared: 0.9946 eNote 4 6 4.2 PRESENTATION MATERIAL F-statistic: 557.2 on 2 and 4 DF, p-value: 1.279e-05 res <- lm(y ~ x3) summary(res) Call: lm(formula = y ~ x3) Residuals: 1 2 -0.228160 -0.007391 3 4 0.282249 -0.044558 5 6 7 0.173049 -0.027071 -0.148118 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.15179 0.15641 0.971 0.376 x3 1.00823 0.03542 28.466 1e-06 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.1931 on 5 degrees of freedom Multiple R-squared: 0.9939,Adjusted R-squared: 0.9926 F-statistic: 810.3 on 1 and 5 DF, p-value: 1.002e-06 The pair-wise relations: pairs(matrix(c(x1, x2, x3, x4, y), ncol = 5, byrow = F), labels = c("x1", "x2", "x3", "x4", "y")) 7 4.2 PRESENTATION MATERIAL 3 4 5 6 7 1 ● ● ● ● 1 2 3 4 5 6 7 ● x2 ● ● ● ● ● ● ● ● ● 7 ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● ● ● x4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● 2 3 4 5 6 7 If this is repeated: # For y <x1 <x2 <x3 <x4 <data2 data2: 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = 1:7 + rnorm(7, sd = <- matrix(c(x1, x2, y ● ● ● 7 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x3 ● ● ● ● ● 6 ● ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ● ● 4 ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● 7 ● ● ● ● 6 ● ● ● 3 5 ● ● ● 2 4 ● ● ● 1 3 ● ● x1 2 ● 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7 1 1 2 3 4 5 6 7 eNote 4 0.2) 0.2) 0.2) 0.2) 0.2) x3, x4, y), ncol = 5, byrow = F) res <- lm(y ~ x1 + x2 + x3 + x4) summary(res) Call: lm(formula = y ~ x1 + x2 + x3 + x4) 1 2 3 4 5 6 7 eNote 4 8 4.2 PRESENTATION MATERIAL Residuals: 1 2 3 4 5 0.018885 0.038602 -0.077600 -0.007769 -0.023867 6 7 0.064100 -0.012351 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.31443 0.08432 3.729 0.065 . x1 0.07037 0.15585 0.451 0.696 x2 0.44155 0.23938 1.845 0.206 x3 -0.05506 0.24668 -0.223 0.844 x4 0.44666 0.20606 2.168 0.162 --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.07987 on 2 degrees of freedom Multiple R-squared: 0.9995,Adjusted R-squared: 0.9985 F-statistic: 1013 on 4 and 2 DF, p-value: 0.0009866 res <- lm(y ~ x1 + x2 + x4) summary(res) Call: lm(formula = y ~ x1 + x2 + x4) Residuals: 1 2 3 4 5 0.027597 0.036597 -0.077031 -0.017868 -0.029246 6 7 0.062158 -0.002207 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.31370 0.06965 4.504 0.0204 * x1 0.05180 0.10895 0.475 0.6669 x2 0.41956 0.18034 2.327 0.1025 x4 0.43347 0.16318 2.656 0.0766 . --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.06602 on 3 degrees of freedom eNote 4 9 4.2 PRESENTATION MATERIAL Multiple R-squared: 0.9995,Adjusted R-squared: 0.999 F-statistic: 1976 on 3 and 3 DF, p-value: 1.93e-05 res <- lm(y ~ x2 + x4) summary(res) Call: lm(formula = y ~ x2 + x4) Residuals: 1 0.0113484 7 0.0003855 2 3 4 5 0.0570135 -0.0664179 -0.0287655 -0.0372574 6 0.0636935 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.32388 0.05952 5.441 0.00554 ** x2 0.45131 0.15044 3.000 0.03995 * x4 0.45049 0.14298 3.151 0.03449 * --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.05929 on 4 degrees of freedom Multiple R-squared: 0.9995,Adjusted R-squared: 0.9992 F-statistic: 3676 on 2 and 4 DF, p-value: 2.958e-07 res <- lm(y ~ x2) summary(res) Call: lm(formula = y ~ x2) Residuals: 1 2 0.009913 -0.003355 3 0.001475 4 5 0.031214 -0.168653 6 7 0.139073 -0.009667 eNote 4 10 4.2 PRESENTATION MATERIAL Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.2191 0.0824 2.659 0.0449 * x2 0.9241 0.0180 51.338 5.3e-08 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.09896 on 5 degrees of freedom Multiple R-squared: 0.9981,Adjusted R-squared: 0.9977 F-statistic: 2636 on 1 and 5 DF, p-value: 5.301e-08 Plot for data set 2: pairs(matrix(c(x1, x2, x3, x4, y), ncol = 5, byrow = F), labels = c("x1", "x2", "x3", "x4", "y")) 7 1 ● ● 1 2 3 4 5 6 7 ● ● ● ● ● ● ● ● ● ● ● x3 ● ● ● ● ● 7 ● ● ● ● x4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● y ● ● ● 7 ● ● ● ● ● ● ● ● 6 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 ● ● ● 5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 3 ● ● ● x2 ● 4 ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● 7 ● ● ● 6 ● ● ● 3 5 ● ● ● 2 4 ● ● ● ● 1 3 ● ● x1 2 ● 7 6 5 5 3 4 1 3 1 2 3 4 5 6 7 2 1 2 3 4 5 6 7 1 ● ● ● ● ● 1 2 3 4 5 6 7 1 2 3 4 5 6 7 eNote 4 11 4.2 PRESENTATION MATERIAL Analysing the two data sets using the means of the four x’es as a single variable instead: xmn1 <- (data1[,1] + data1[,2] + data1[,3] + data1[,4])/4 xmn2 <- (data2[,1] + data2[,2] + data2[,3] + data2[,4])/4 rm1 <- lm(data1[,5] ~ xmn1) rm2 <- lm(data2[,5] ~ xmn2) summary(rm1) Call: lm(formula = data1[, 5] ~ xmn1) Residuals: 1 2 0.005688 -0.177580 3 4 5 0.240998 -0.004289 -0.110701 6 7 0.116733 -0.070850 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.02453 0.12888 0.19 0.857 xmn1 1.01966 0.02878 35.43 3.37e-07 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.1553 on 5 degrees of freedom Multiple R-squared: 0.996,Adjusted R-squared: 0.9952 F-statistic: 1255 on 1 and 5 DF, p-value: 3.37e-07 summary(rm2) Call: lm(formula = data2[, 5] ~ xmn2) Residuals: 1 2 3 0.141510 -0.084187 -0.119556 4 5 0.001988 -0.028454 6 0.058752 7 0.029946 eNote 4 4.2 PRESENTATION MATERIAL 12 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.25039 0.07983 3.137 0.0258 * xmn2 0.92743 0.01762 52.644 4.68e-08 *** --Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Residual standard error: 0.09651 on 5 degrees of freedom Multiple R-squared: 0.9982,Adjusted R-squared: 0.9978 F-statistic: 2771 on 1 and 5 DF, p-value: 4.676e-08 By the way, check what the loading structure is for a PCA of the X-data: # Almost all variance explained in first component: princomp(data1[,1:4]) Call: princomp(x = data1[, 1:4]) Standard deviations: Comp.1 Comp.2 Comp.3 Comp.4 4.08135743 0.24012411 0.14468073 0.03297056 4 variables and 7 observations. # The loadings of the first component: princomp(data1[,1:4])$loadings[,1] [1] -0.5145243 -0.4707128 -0.5032351 -0.5103416 Note how they are almost the same such that the first component essentially is the mean of the four variables. Let us save the beta-coefficients for some preditions - one complete set from Data 1 and one from the mean analysis: eNote 4 4.2 PRESENTATION MATERIAL 13 cf1 <- summary(lm(data1[,5] ~ data1[,1] + data1[,2] + data1[,3] + data1[,4]))$coefficients[,1] cf <- summary(rm2)$coefficients[,1] We now simulate how the three approaches (full model, mean (=PCR) and single varable) perform in 7000 predictions: # Simulation of prediction: error <- 0.2 y <- rep(1:7, 1000) + rnorm(7000, sd = error) x1 <- rep(1:7, 1000) + rnorm(7000, sd = error) x2 <- rep(1:7, 1000) + rnorm(7000, sd = error) x3 <- rep(1:7, 1000) + rnorm(7000, sd = error) x4 <- rep(1:7, 1000) + rnorm(7000, sd = error) yhat <- cf1[1] + matrix(c(x1, x2, x3, x4), ncol = 4, byrow = F) %*% t(t(cf1[2:5])) xmn <- (x1 + x2 + x3 + x4)/4 yhat2 <- cf[1] + cf[2] * xmn yhat3 <- cf[1] + cf[2] * x3 barplot(c(sum((y-yhat)^2)/7000, sum((y-yhat2)^2)/7000, sum((y-yhat3)^2)/7000), col = heat.colors(3), names.arg = c("Full MLR","Average","x3"), cex.names = 1.5, main = "Average squared prediction error") eNote 4 14 4.2 PRESENTATION MATERIAL 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Average squared prediction error Full MLR Average x3 The PCR-like analysis is the winner! 4.2.2 Example: Spectral type data Constructing some artificial spectral data: (7 observations, 100 wavelengths) # Spectral Example x <- (-39:60)/10 spectra <- matrix(rep(0, 700), ncol = 100) for (i in 1:7) spectra[i,] <- i * dnorm(x) + i * dnorm(x) * rnorm(100, sd = 0.02) y <- 1:7 + rnorm(7, sd = 0.2) eNote 4 15 4.2 PRESENTATION MATERIAL 0.0 0.5 1.0 1.5 2.0 2.5 matplot(t(spectra), type = "n", xlab = "Wavelength", ylab = "") matlines(t(spectra)) 0 20 40 60 80 Wavelength Mean spectrum indicated: matplot(t(spectra), type = "n", xlab = "Wavelength", ylab = "") matlines(t(spectra)) meansp <- apply(spectra, 2, mean) lines(1:100, meansp, lwd = 2) 100 16 4.2 PRESENTATION MATERIAL 0.0 0.5 1.0 1.5 2.0 2.5 eNote 4 0 20 40 60 Wavelength The mean centered spectra: spectramc<-scale(spectra,scale=F) matplot(t(spectramc),type="n",xlab="Wavelength",ylab="") matlines(t(spectramc)) 80 100 17 4.2 PRESENTATION MATERIAL −1.0 −0.5 0.0 0.5 1.0 eNote 4 0 20 40 60 Wavelength The standardized spectra: spectramcs<-scale(spectra,scale=T,center=T) matplot(t(spectramcs),type="n",xlab="Wavelength",ylab="") matlines(t(spectramcs)) 80 100 18 4.2 PRESENTATION MATERIAL −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 eNote 4 0 20 40 60 80 Wavelength # Doing the PCA on the correlation matrixs with the eigen-function: pcares <- eigen(cor(spectra)) loadings1 <- pcares$vectors[,1] scores1 <- spectramcs%*%t(t(loadings1)) pred <- scores1 %*% loadings1 stdsp<-apply(spectra, 2, sd) ## 1-PCA Predictions transformed to original scales and means: predorg <- pred * matrix(rep(stdsp, 7), byrow=T, nrow=7) + matrix(rep(meansp, 7), nrow=7, byrow=T) All the plots collected in a single overview plot: 100 eNote 4 4.2 PRESENTATION MATERIAL par(mfrow = c(3, 3), mar = 0.6 * c(5, 4, 4, 2)) matplot(t(spectra), type = "n", xlab = "Wavelength", ylab = "", main = "Raw spectra", las = 1) matlines(t(spectra)) matplot(t(spectramc), type = "n", xlab = "Wavelength", ylab = "", main = "Mean corrected spectra", las = 1) matlines(t(spectramc)) matplot(t(spectramcs), type = "n", xlab = "Wavelength", ylab = "", main = "Standardized spectra", las = 1) matlines(t(spectramcs)) matplot(t(spectra), type = "n", xlab = "Wavelength", ylab = "", main = "Mean Spectrum", las = 1) lines(1:100, meansp, lwd = 2) plot(1:100, -loadings1, ylim = c(0, 0.2), xlab = "Wavelength", ylab = "", main = "PC1 Loadings", las = 1) matplot(t(pred), type = "n", xlab = "Wavelength", ylab = "", main = "Reconstruction using PC1", las = 1) matlines(t(pred)) matplot(t(spectra), type = "n", xlab = "Wavelength", ylab = "", main = "Standard deviations", las = 1) lines(1:100, stdsp, lwd = 2) plot(1:7, scores1[7:1], main = "PC1 Scores", xlab = "Samples", ylab = "", las = 1) matplot(t(predorg), type = "n", xlab = "Wavelength", ylab = "", main = "Reconstruction using PC1") matlines(t(predorg)) 19 eNote 4 20 4.2 PRESENTATION MATERIAL Raw spectra Mean corrected spectra Standardized spectra 1.5 2.5 1.0 2.0 0.5 1.0 1.5 1.0 0.5 0.0 0.0 −0.5 −0.5 0.5 −1.0 −1.0 0.0 −1.5 0 20 40 60 80 100 0 20 40 Wavelength Mean Spectrum 60 80 100 0 20 40 60 80 Wavelength Wavelength PC1 Loadings Reconstruction using PC1 100 0.20 2.5 1.0 2.0 0.15 1.5 0.10 0.5 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 1.0 0.0 −0.5 0.05 0.5 −1.0 0.0 0.00 0 20 40 60 80 100 −1.5 0 20 40 60 80 Wavelength Wavelength Standard deviations PC1 Scores 100 0 20 40 60 80 100 Wavelength Reconstruction using PC1 5 1.5 ● ● 0 ● −10 0.0 0.0 0.5 1.0 ● ● −5 0.5 1.0 2.0 10 2.0 1.5 2.5 2.5 ● ● 0 20 40 60 80 100 1 2 3 Wavelength 4 5 6 7 0 20 40 Samples par(mfrow = c(1, 1)) 4.2.3 Some more presentation stuff PCR: what is it? • Data Situation:     y=    y1 y2 .. . .. . yn      , X =           x11 x12 · · · · · · x1p x21 x22 · · · · · · x2p .. .. .. .. . . . . .. .. .. . . . xn1 xn2 · · · · · · xip 60 Wavelength         80 100 eNote 4 21 4.2 PRESENTATION MATERIAL • Do MLR with A principal components t1 , . . . , t A instead of all (or some) of the x’s. • How many components: Determine by Cross-validation! How to do it? 1. Explore data 2. Do modelling (choose number of components, consider variable selection) 3. Validate (residuals, outliers, influence etc) 4. Iterate e.g. on 2. and 3. 5. Interpret, conclude, report. 6. If relevant: predict future values. Cross Validation ("Full") • Leave out one of the observations • Fit a model on the remaining(reduced) data • Predict the left out observation by the model: ŷi,val • Do this in turn for ALL observations AND calculate the overall performance of the model: s n RMSEP = ∑(yi − ŷi,val )2 /n i (Root Mean Squared Error of Prediction) Cross Validation ("Full") Finally: Do the cross-validation for ALL choices of number of components (0, 1, 2, . . . , . . .) AND plot the performances, e.g.: (constructed plot) barplot(c(10, 5, 3, 3.1, 3.2, 4, 6, 9), names.arg = 0:7, xlab = "No components", ylab = "RMSEP", cex.names = 2, main = "Validation results") 22 4.2 PRESENTATION MATERIAL Validation results 0 2 4 RMSEP 6 8 10 eNote 4 0 1 2 3 4 5 6 7 No components Cross Validation ("Full") Choose the optimal number of components: • The one with overall minimal error • The first local mininum • In Hastie et al: the smallest number within the uncertainties of the overall minimum one. Resampling • Cross-Validation (CV) • Jackknifing (Leave-on-out CV) eNote 4 4.2 PRESENTATION MATERIAL 23 • Bootstrapping • A good generic approach: – Split the data into a TRAINING and a TEST set. – Use Cross-validation on the TRAINING data – Check the model performance on the TEST-set – MAYBE: REPEAT all this many times (Repeated Double Cross Validation) Cross Validation - principle • Minimizes the expected prediction error: Squared Prediction error = Bias2 + Variance • Including ”many”PC-components: LOW bias, but HIGH variance • Including ”few”PC-components: HIGH bias, but LOW variance • Choose the best compromise! • Note: Including ALL components = MLR (when n > p) Validation - exist on different levels 1. Split in 3: Training(50%), Validation(25%) and Test(25%) • Requires many observations - Rarely used 2. Split in 2: Calibration/training (67% ) and Test(33%) - us CV/bootstrap within the training • more commonly used 3. No ”fixed split”, but repeated splits by CV/bootstrap, and then CV within each training set (”Repeated double CV”) 4. No split, but using (one level of) CV/bootstrap. 5. Just fitting on all - and checking the error. eNote 4 24 4.3 EXAMPLE: CAR DATA (AGAIN) 4.3 Example: Car Data (again) # Example: using Car data: data(mtcars) mtcars$logmpg <- log(mtcars$mpg) # Define the X-matrix as a matrix in the data frame: mtcars$X <- as.matrix(mtcars[, 2:11]) # First of all we consider a random selection of 4 properties mtcars$train <- TRUE mtcars$train[sample(1:length(mtcars$train), 4)] <- FALSE as a TEST set mtcars_TEST <- mtcars[mtcars$train == FALSE,] mtcars_TRAIN <- mtcars[mtcars$train == TRUE,] Now all the work is performed on the TRAIN data set. Explore the data We allready did this previously, so no more of that here Next: Model the data Run the PCR with maximal/large number of components using pls package: # Run the PCR with maximal/large number of components using pls package: library(pls) mod <- pcr(logmpg ~ X , ncomp = 10, data = mtcars_TRAIN, validation="LOO", scale = TRUE, jackknife = TRUE) Initial set of plots: # Initial set of plots: par(mfrow = c(2, 2)) eNote 4 25 4.3 EXAMPLE: CAR DATA (AGAIN) plot(mod, labels = rownames(mtcars_TRAIN), which = "validation") plot(mod, "validation", estimate = c("train", "CV"), legendpos = "topright") plot(mod, "validation", estimate = c("train", "CV"), val.type = "R2", legendpos = "bottomright") scoreplot(mod, labels = rownames(mtcars_TRAIN)) 0.30 logmpg train CV RMSEP 0.20 0.15 2.8 2.6 predicted 3.0 3.2 Toyota Corolla Datsun 710 128 Lotus Fiat Europa Merc 230 Toyota Corona Volvo 142E Ford Pantera L Porsche 914−2 Mazda RX4 Mazda RX4 Wag Hornet 4 Drive MercFerrari 280C Dino AMC Javelin Dodge Challenger Merc 280 Merc 450SLC Hornet Sportabout Merc 450SL Pontiac Firebird DusterMerc 360 450SE Camaro Z28 Cadillac Fleetwood Maserati Bora 0.25 3.4 logmpg, 10 comps, validation 0.10 2.4 Lincoln Continental Chrysler Imperial 2.4 2.6 2.8 3.0 3.2 3.4 0 2 4 measured 6 8 10 number of components Hornet 4 Drive Toyota230 Corona Merc Dodge Challenger AMC Javelin Pontiac Firebird Fleetwood Hornet Sportabout Merc 450SLC Cadillac Merc 450SL Merc 450SE Lincoln Continental Chrysler Imperial Merc Merc280C 280 0 −1 Comp 2 (27 %) ToyotaFiat Corolla 128 Datsun 710 Volvo 142E Duster 360 Camaro Z28 Lotus Europa Mazda RX4 Wag Mazda RX4 Porsche 914−2 train CV 0 2 4 6 number of components 8 10 Ferrari Dino Ford Pantera L −4 0.0 −3 −2 0.4 0.2 R2 0.6 1 2 0.8 logmpg Maserati Bora −4 −2 0 2 Comp 1 (57 %) Choice of components: # Choice of components: # what would segmented CV give: mod_segCV <- pcr(logmpg ~ X , ncomp = 10, data = mtcars_TRAIN, scale = TRUE, validation = "CV", segments = 5, segment.type = c("random"), jackknife = TRUE) eNote 4 26 4.3 EXAMPLE: CAR DATA (AGAIN) # Initial set of plots: par(mfrow = c(1, 2)) plot(mod_segCV, "validation", estimate = c("train", "CV"), legendpos = "topright") plot(mod_segCV, "validation", estimate = c("train", "CV"), val.type = "R2", legendpos = "bottomright") 0.30 logmpg logmpg 0.4 0.2 R2 0.20 0.0 0.15 RMSEP 0.6 0.25 0.8 train CV 0.10 train CV 0 2 4 6 8 10 number of components 0 2 4 6 number of components Let us look at some more components: # Let us look at some more components: # Scores: scoreplot(mod, comps = 1:4, labels = rownames(mtcars_TRAIN)) 8 10 eNote 4 27 4.3 EXAMPLE: CAR DATA (AGAIN) −4 −3 −2 −1 0 1 2 −0.5 0.0 0.5 1.0 Hornet 4 Drive Toyota Corona Merc 230 Merc 230 0 1 Dodge Challenger AMC Javelin Pontiac Firebird Hornet Sportabout Merc 450SLC Cadillac Fleetwood Merc 450SL Merc 450SE Lincoln Continental Toyota Corolla Fiat 128 710 Chrysler Imperial Merc Datsun Merc280C 280 Volvo 142E Duster 360 Camaro Z28 Lotus Europa Mazda RX4 Wag Mazda RX4 Porsche 914−2 −2 Comp 2 (26.6 %) −4 Toyota Corona Toyota Corona Merc 142E 230 Porsche Porsche 914−2 Volvo 142E Datsun 710 914−2 DatsunVolvo 710 Lotus Europa Lotus Europa Fiat 128 Fiat 128 Toyota Corolla Toyota Corolla Hornet 4 Drive Toyota Corona −4 Hornet 4 Drive Toyota Corona Ferrari Dino Ford Pantera L Maserati Bora Ferrari Dino Maserati Bora Dodge Challenger AMC Javelin Hornet Sportabout Porsche 914−2 Pontiac MazdaFirebird RX4 Mazda RX4 Wag Lotus Europa Merc 450SL Datsun 710Corolla Merc 450SE Merc 450SLC Duster 360 Toyota Fiat 128 Ford Pantera L Hornet 4 DriveCamaro Z28 Toyota Ferrari DinoCorona Volvo 142E Cadillac Fleetwood Chrysler Imperial Lincoln Continental 0.5 Comp 3 (6.6 %) 1.0 Ford Pantera L Maserati Bora Dodge Challenger Dodge Challenger AMC Javelin AMC Javelin Hornet Sportabout Hornet Sportabout Porsche 914−2 Porsche 914−2 Mazda RX4 Pontiac Firebird Mazda RX4Pontiac Firebird Mazda RX4 Wag Mazda RX4 Wag Lotus Europa Lotus Europa Merc 450SL Merc 450SL Datsun 710 Datsun 710 Merc 450SE 450SE Merc 450SLC Merc 450SLC Duster 360 Ford Pantera L Duster 360 Toyota Corolla Toyota Corolla Fiat 128 Fiat 128 Pantera L Z28 HornetFord 4 Drive Hornet 4 Drive Camaro Camaro Z28 Toyota Toyota Corona Corona Ferrari Dino Ferrari Dino Volvo 142E Volvo 142E Cadillac Fleetwood Cadillac Fleetwood ChryslerContinental Imperial Chrysler Imperial Lincoln Lincoln Continental Maserati Bora Maserati Bora Merc 230 Dodge Challenger Dodge Merc Challenger AMCSportabout Javelin AMC Javelin Pontiac Firebird Pontiac Firebird Hornet Hornet Sportabout Merc 450SLC 450SLCCadillac Cadillac Fleetwood Fleetwood Merc 450SL Merc 450SL Merc 450SE Merc 450SE Lincoln Continental Lincoln Continental Toyota Toyota Corolla F iat Corolla 128 Fiat Chrysler Chrysler Imperial Merc 280C Merc 280 280C Datsun 710 Datsun 710128 Merc 280 Imperial Merc Volvo 142E Volvo 142E Duster 360 Duster 360 Camaro Z28 Camaro Z28 Lotus RX4 Europa Lotus Europa Mazda Wag MazdaRX4 RX4 Wag Mazda RX4 Mazda Porsche 914−2 Porsche 914−2 Ferrari Dino Ford Pantera L Merc 280 Merc 280C Merc 230 Ferrari HornetDino 4 Drive MazdaRX4 RX4 Wag Merc Merc 280 280CMazda −0.5 2 Toyota Corona Merc 230 Merc 230 Porsche 914−2 Volvo 142E Datsun 710 Lotus Europa Fiat 128 Toyota Corolla Ferrari HornetDino 4 Drive Mazda RX4 Wag Mazda RX4 −2 Ferrari Dino Hornet 4 Drive Mazda Wag 280C 280 Merc 280 MazdaRX4 RX4Merc Merc 280C Maserati Merc Bora 280 Merc 280C Merc 280 Merc 280C Merc 230 Merc 230 −1.5 Comp 1 (56.6 %) 0 2 Lincoln Lincoln Continental Lincoln Continental CadillacContinental Fleetwood Cadillac Fleetwood Cadillac Fleetwood Chrysler Imperial Chrysler Imperial Chrysler Imperial DusterZ28 360 Duster 360 Duster 360 Camaro Z28 Camaro Z28 Maserati Bora Maserati Bora Camaro Maserati Bora Pontiac Firebird Pontiac Firebird Pontiac Firebird Merc 450SE Merc 450SE Merc 450SE Dodge Challenger Dodge Challenger Dodge Challenger Hornet Sportabout Hornet Sportabout Hornet Sportabout Merc 450SLC 450SL Merc 450SL Merc 450SL Merc Merc 450SLC Merc 450SLC AMC Javelin AMC Javelin AMC Javelin Ford Pantera L Ford Pantera L Ford Pantera L Porsche 914−2 Porsche 914−2 Chrysler Imperial Chrysler Imperial Chrysler Imperial Ford Pantera L Continental Ford Pantera L Ford Pantera L Lincoln Lincoln Continental Lincoln Continental Cadillac Fleetwood Cadillac Fleetwood Cadillac Fleetwood Mazda Mazda RX4 Wag Mazda RX4 Wag Volvo 142E Volvo 142E Merc 230 VolvoToyota 142E Merc 230 RX4 Wag Merc 230 Toyota Corolla Toyota Fiat 128 Fiat Corolla 128 Fiat Corolla 128 Mazda RX4 Pontiac Mazda RX4Pontiac Mazda RX4 Firebird Camaro Z28 Camaro Z28 Firebird CamaroPontiac Z28 Firebird Datsun 710 DatsunSportabout 710 Datsun 710Sportabout Hornet Sportabout Hornet Hornet AMC Javelin AMC Javelin AMC Javelin Merc 450SE Merc 450SE Merc 450SE Merc Merc 450SLC Merc 450SLC Merc450SLC 450SL Merc280C 450SL Merc 450SL Merc Merc Merc 280C Toyota Corona Toyota Corona Corona Merc280C 280 Merc 280 Merc 280 ToyotaDuster 360 Duster 360 360 Challenger Dodge Challenger Dodge Challenger Hornet 4 Dodge Drive Duster Hornet 4 Drive Hornet 4 Drive Ferrari Dino Ferrari Dino Ferrari Dino Maserati Bora Maserati Bora Maserati Bora Lotus Europa Lotus Europa Lotus Europa 0.5 Porsche 914−2 −0.5 0.0 Comp 4 (2.7 %) −4 −2 0 2 −1.5 −0.5 0.0 0.5 1.0 #Loadings: loadingplot(mod,comps = 1:4, scatter = TRUE, labels = names(mtcars_TRAIN)) 28 4.3 EXAMPLE: CAR DATA (AGAIN) 0.0 0.2 0.4 −0.2 mpg carb logmpg cyl drat drat gear disp X mpg logmpg cyl carb 0.0 0.2 mpg carb disp X gear 0.4 0.4 −0.2 disp X logmpg cyl drat 0.2 −0.4 gear qsec hp train qsec hp train qsec drat logmpg cyl logmpg cyl Comp 2 (26.6 %) mpg carb mpg carb hp tXrain disp disp X gear gear vs hp train disp X vs am gear am vs vs mpg carb mpg carb logmpg cyl hp am train disp X gear mpg carb logmpg cyl am logmpg cyl Comp 3 (6.6 %) hp train disp X drat qsec wt drat qsec wt qsec gear hp train 0.4 drat logmpg cyl vs −0.2 carb disp Xmpg gear disp X 0.2 0.4 vs Comp 4 (2.7 %) am disp X qsec 0.0 drat logmpg cyl wt mpg carb gear qsec drat wt am wt hp train drat logmpg cyl vs wt hp train disp X am gear hp train am vs 0.0 vs am wt vs qsec −0.2 −0.4 am vs wt qsec drat drat logmpg cyl mpg carb −0.2 am hp train wt 0.0 0.2 0.4 wt hp train 0.2 wt qsec 0.2 wt am vs −0.2 0.0 Comp 1 (56.6 %) −0.4 eNote 4 mpg carb gear qsec −0.4 −0.2 0.0 0.2 We choose 3 components: # We choose 4 components mod3 <- pcr(logmpg ~ X , ncomp = 3, data = mtcars_TRAIN, validation = "LOO", scale = TRUE, jackknife = TRUE) Then: Validate: Let’s validate som more: using 3 component. We take the predicted and hence the residuals from the predplot function Hence these are the (CV) VALIDATED versions! par(mfrow = c(2, 2)) k=3 obsfit <- predplot(mod3, labels = rownames(mtcars_TRAIN), which = "validation") eNote 4 29 4.3 EXAMPLE: CAR DATA (AGAIN) Residuals <- obsfit[,1] - obsfit[,2] plot(obsfit[,2], Residuals, type="n", main = k, xlab = "Fitted", ylab = "Residuals") text(obsfit[,2], Residuals, labels = rownames(mtcars_TRAIN)) qqnorm(Residuals) # To plot residuals against X-leverage, we need to find the X-leverage: # AND then find the leverage-values as diagonals of the Hat-matrix: # Based on fitted X-values: Xf <- scores(mod3) H <- Xf %*% solve(t(Xf) %*% Xf) %*% t(Xf) leverage <- diag(H) plot(leverage, abs(Residuals), type = "n", main = k) text(leverage, abs(Residuals), labels = rownames(mtcars_TRAIN)) logmpg, 3 comps, validation 3 −0.1 2.8 Ford Javelin Pantera L AMC Dodge Challenger Hornet Sportabout Merc 450SLC Merc 450SL Merc 450SE Pontiac Firebird Camaro Z28360 Duster CadillacContinental Fleetwood Lincoln Maserati Bora Chrysler Imperial 2.4 2.6 Volvo 142E Datsun 710 −0.2 Lincoln Continental Cadillac Fleetwood 2.8 3.0 3.2 3.4 2.6 2.8 measured 3.0 3 Pontiac Firebird ● Chrysler Imperial Cadillac Fleetwood Datsun 710 Continental VolvoLincoln 142E Toyota Corolla 128 Hornet Fiat Sportabout AMC Javelin abs(Residuals) ● 0.0 ● ● ● ● ● Mazda RX4 Porsche 914−2 Merc 450SLC Ferrari Dino Mazda RX4 WagMerc 280 ● −2 ● ● ● 0.00 ● ToyotaEuropa Corona Lotus Ford Pantera L Merc 450SL Hornet 4 Drive Camaro Z28 Dodge Challenger Merc 450SE Merc 280C 0.05 −0.1 ● ● ● −1 Maserati Bora 0.10 0.1 ● ●● ●● −0.2 Sample Quantiles ● ● 0.20 ● ● 0.15 0.2 ● ●● 3.2 Fitted Normal Q−Q Plot ● Toyota Corolla Fiat 128 Lotus Europa Merc 450SL Hornet 4 Drive Merc 450SE Merc 230 Ferrari Dino Merc 280 Duster 360 Mazda RX4 Wag Merc 450SLC Mazda RX4 Porsche 914−2 Merc Dodge Challenger280C Camaro Z28 Ford Pantera L Toyota Corona AMC Javelin 0.1 Residuals 3.0 Hornet 4 Drive Ferrari Dino MercMerc 280C280 2.6 predicted Mazda RX4 Merc 230 Mazda RX4 Wag 0.0 3.2 0.2 Toyota Corolla Pontiac Firebird Fiat 128 Chrysler Imperial Datsun 710 Lotus Porsche 914−2Europa Maserati Bora Volvo 142E Hornet Sportabout Toyota Corona 0 Theoretical Quantiles 1 2 Merc 230 Duster 360 0.05 0.10 0.15 0.20 leverage 0.25 0.30 0.35 eNote 4 30 4.3 EXAMPLE: CAR DATA (AGAIN) # Let’s also plot the residuals versus each input X: par(mfrow=c(3,4)) for ( i in 2:11){ plot(Residuals~mtcars_TRAIN[,i],type="n",xlab=names(mtcars_TRAIN)[i]) text(mtcars_TRAIN[,i],Residuals,labels=row.names(mtcars_TRAIN)) lines(lowess(mtcars_TRAIN[,i],Residuals),col="blue") } 5 6 7 8 200 300 0.2 100 150 200 250 300 4 18 wt 20 22 0.0 0.2 qsec 0.2 0.1 0.0 Residuals −0.2 −0.1 0.0 −0.2 −0.1 Residuals 0.1 Toyota Corolla Hornet Sportabout Fiat 128 3.0 3.5 4.0 4.5 gear 5.0 1 2 3 4 5 6 7 8 carb Interpret/conclude Now let’s look at the results - ”interpret/conclude”: 0.2 Residuals Ferrari Dino −0.1 Mazda RX4 Wag Porsche Mazda 914−2 RX4 Ford Pantera L Volvo 142E Datsun 710 −0.2 0.4 0.6 vs Pontiac Firebird Chrysler Imperial Maserati Bora Maserati Bora Toyota Corolla Fiat 128 Hornet Sportabout Lotus Europa Lotus Europa Merc 450SL Merc 450SL Hornet 4 Drive Hornet 4 Drive Merc 450SE Merc 450SE Merc 230 Merc 230 Ferrari Dino Ferrari Dino Merc 280 Merc 280 Duster 360 Duster 360 Mazda RX4 Wag Mazda RX4 Wag Merc 450SLC Merc 450SLC Mazda RX4 Porsche 914−2 Porsche 914−2 Mazda RX4 Merc 280C Dodge Challenger Merc 280C Dodge Challenger Camaro Z28 Camaro Z28 Ford Pantera Toyota L Toyota Corona CoronaFord Pantera L AMC Javelin AMC Javelin Volvo 142E VolvoLincoln 142E Continental Lincoln Continental Datsun 710 Datsun 710 Cadillac Fleetwood Cadillac Fleetwood 0.2 Pontiac Firebird Chrysler Imperial Maserati Bora Toyota Corolla Fiat 128 Lotus Europa 0.1 0.2 0.0 Residuals −0.1 16 4.0 Pontiac Firebird Chrysler Imperial Toyota Corolla Fiat 128Hornet Sportabout Lotus Europa Merc 450SL Hornet 4 Drive Hornet 4 Drive Merc 450SE Merc 230 Merc 230 Merc 280 Merc 280 Duster 360 Merc 450SLC Merc 280C Merc 280C Dodge Challenger Camaro Z28 Toyota Corona Toyota Corona AMC Javelin Volvo 142E Lincoln Continental Datsun 710 Cadillac Fleetwood −0.2 5 3.5 drat 0.1 0.2 −0.2 −0.1 0.0 Residuals 0.1 0.2 0.1 0.0 −0.2 3 3.0 hp Pontiac Firebird Pontiac Firebird Pontiac Firebird Chrysler Imperial Chrysler Maserati BoraChrysler Imperial Maserati Bora MaseratiImperial Bora Toyota Corolla Toyota Corolla Fiat 128 Fiat 128 Hornet Sportabout Hornet Sportabout Hornet Sportabout Lotus Europa Lotus Europa Merc 450SL Merc 450SL Merc 450SL Hornet 4 Drive Hornet 4 Drive Merc 450SE Merc 450SE Merc 450SE Merc 230 Merc 230 Ferrari Dino Ferrari Dino Ferrari Dino Merc 280 Merc 280 Duster 360 Duster 360 Duster 360 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag Merc 450SLC Merc 450SLC Merc 450SLC Porsche 914−2 Porsche 914−2 Porsche 914−2 Mazda RX4 Mazda RX4 Mazda RX4 Merc 280C Merc 280C Dodge Challenger Dodge Challenger Dodge Challenger CamaroL Z28 Camaro Z28L Pantera Ford Camaro Pantera Z28 L Ford Pantera ToyotaFord Corona Toyota Corona AMC Javelin AMC Javelin AMC Javelin Volvo 142E Volvo 142E Lincoln Continental LincolnDatsun Continental Lincoln Continental Datsun 710 710 Cadillac Fleetwood Cadillac Fleetwood Cadillac Fleetwood 2 0.0 Residuals −0.1 −0.2 400 disp −0.1 Residuals 0.1 0.2 −0.1 −0.2 100 cyl 0.0 4 0.0 Residuals 0.1 −0.2 −0.1 0.0 Residuals 0.1 0.0 Residuals −0.2 −0.1 0.1 0.2 0.2 Pontiac Firebird Pontiac Firebird Pontiac Firebird Pontiac Firebird Chrysler Chrysler Imperial Chrysler Imperial ChryslerMaserati ImperialBora MaseratiImperial Bora Maserati Bora Maserati Bora Toyota Corolla Toyota Corolla Toyota Corolla Fiat 128 Fiat 128 Fiat 128 Hornet Sportabout Hornet Sportabout Hornet Sportabout Hornet Sportabout Lotus Europa Lotus Europa Lotus Europa Merc4450SL Merc 450SL Merc 450SL Hornet 4 Drive Merc 450SL Hornet Drive Hornet 4 Drive Hornet 4 Drive Merc 450SE Merc 450SE Merc 450SE Merc 450SE Merc 230 Merc 230 Merc 230Ferrari Dino Ferrari Dino Ferrari Dino Ferrari Merc Dino 230 Merc 280 Merc 280 Merc 280 Merc 280 Duster 360 Duster 360 Duster 360 Duster 360 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag Merc 450SLC Porsche Merc 450SLC Merc Merc 450SLC Mazda RX4 Porsche 914−2 Mazda RX4 914−2 Porsche 914−2 Porsche 914−2 Mazda RX4 Mazda RX4 450SLC Merc 280C Dodge Challenger Merc 280C Merc 280C Merc 280C Dodge Challenger Dodge Challenger Dodge Challenger Camaro Z28L Camaro Z28L Camaro Z28 L Camaro Ford Z28 Pantera L Ford Pantera Ford Pantera Ford Pantera Toyota Corona Toyota Corona Toyota Corona Toyota Corona AMC Javelin AMC Javelin AMC Javelin AMC Javelin Volvo 142E Volvo 142E Volvo 142E Volvo 142E Lincoln Continental Lincoln Continental Lincoln Continental Lincoln Continental Datsun 710 Cadillac Fleetwood Datsun 710 Cadillac Fleetwood Datsun 710 Cadillac Fleetwood Cadillac Fleetwood Datsun 710 Toyota Corolla Fiat 128 Lotus Europa 0.8 1.0 0.0 0.2 0.4 0.6 am 0.8 1.0 eNote 4 31 4.3 EXAMPLE: CAR DATA (AGAIN) # Now let’s look at the results par(mfrow = c(2, 2)) - 4) "interpret/conclude" # Plot coefficients with uncertainty from Jacknife: obsfit <- predplot(mod3, labels = rownames(mtcars_TRAIN), which = "validation") abline(lm(obsfit[,2] ~ obsfit[,1])) plot(mod, "validation", estimate = c("train", "CV"), val.type = "R2", legendpos = "bottomright") coefplot(mod3, se.whiskers = TRUE, labels = prednames(mod3), cex.axis = 0.5) biplot(mod3) 3.0 0.6 R2 Hornet 4 Drive Ferrari Dino MercMerc 280C280 0.2 2.8 Ford Javelin Pantera L AMC Dodge Challenger Hornet Sportabout Merc 450SLC Merc 450SL Merc 450SE Pontiac Firebird Camaro Z28360 Duster 0.0 2.6 predicted Mazda RX4 Merc 230 Mazda RX4 Wag CadillacContinental Fleetwood Lincoln Maserati Bora Chrysler Imperial 2.4 2.6 2.8 logmpg 0.4 3.2 Toyota Corolla Fiat 128 Datsun 710 Lotus Porsche 914−2Europa Volvo 142E Toyota Corona 0.8 logmpg, 3 comps, validation 3.0 3.2 3.4 train CV 0 2 measured 0.06 logmpg −0.6 4 6 8 number of components −0.4 −0.2 0.0 0.2 X scores and X loadings 10 0.4 0.2 Mazda RX4 Wag Mazda RX4 Porsche 914−2 drat hp −4 −0.08 am Ferrari Dino carb gear Ford Pantera L disp drat qsec am carb variable # And then finally some output numbers: jack.test(mod3, ncomp = 3) Maserati Bora −4 −2 0 Comp 1 2 −0.2 Lotus Europa −0.4 0 Dustercyl 360 Camaro Z28 0.0 Dodge Challenger AMC Javelin Pontiac Firebird wt Continental Hornet Sportabout Merc 450SLC Cadillac Fleetwood Merc 450SL Merc 450SE Lincoln Chrysler disp Imperial Toyota Corolla Fiat 128 Datsun 710 Merc Merc280C 280 Volvo 142E −2 −0.04 −0.02 Comp 2 0.00 vs −0.6 0.02 2 Hornet 4 Drive Toyota Corona Merc 230 −0.06 regression coefficient 0.04 0.4 qsec eNote 4 32 4.3 EXAMPLE: CAR DATA (AGAIN) Response logmpg (3 comps): Estimate Std. Error Df cyl -0.0366977 0.0077887 27 disp -0.0452754 0.0108002 27 hp -0.0557347 0.0118127 27 drat 0.0213254 0.0149417 27 wt -0.0707133 0.0134946 27 qsec -0.0073511 0.0137758 27 vs 0.0028425 0.0168228 27 am 0.0436837 0.0128767 27 gear 0.0104731 0.0109513 27 carb -0.0635746 0.0198725 27 --Signif. codes: 0 ’***’ 0.001 t value -4.7116 -4.1921 -4.7182 1.4272 -5.2401 -0.5336 0.1690 3.3925 0.9563 -3.1991 Pr(>|t|) 6.611e-05 0.0002658 6.495e-05 0.1649761 1.598e-05 0.5979674 0.8670842 0.0021513 0.3473857 0.0035072 *** *** *** *** ** ** ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 Prediction # And now let’s try to predict the 4 data points from the TEST set: preds <- predict(mod3, newdata = mtcars_TEST, comps = 3) plot(mtcars_TEST$logmpg, preds) 33 4.4 EXERCISES 2.96 eNote 4 2.94 ● 2.92 ● 2.90 2.86 2.88 preds ● ● 2.9 3.0 3.1 3.2 3.3 3.4 mtcars_TEST$logmpg rmsep <- sqrt(mean((mtcars_TEST$logmpg - preds)^2)) rmsep [1] 0.3452285 4.4 Exercises Exercise 1 Prostate Cancer data Use the Prostate data also used for the MLR-exercises. Make sure to support all the analysis by a lot of plotting - try to ”play around” with plotting results with/without eNote 4 4.4 EXERCISES 34 extreme observations. Also try to look at results for different choices of number of components (to get a feeling for the consequence of that). The Rcode (including comments) for the Leslie Salt data above can be used as a template for a good approach. a) Define test sets and training sets according to the last column in the data set. Do PCR on the training set – use cross-validation. Go through ALL the relevant steps : 1. model selection 2. validation 3. interpretation 4. etc. b) Predict the test set lpsa values. What is the average prediction error for the test set? c) Compare with the cross validation error in the training set - try both ”LOO”and a segmented version of CV d) Compare with an MLR prediction using ALL predictors

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download eNote 4