Download Computer lab 6: Model selection and cross validation

732A20 Data Mining and Statistical Learning Department of Computer and Information Science Computer lab 6: Model selection and cross validation Learning objectives The main objective of this computer lab is to increase the understanding of the trade-off between bias and variance in statistical prediction and to make the student familiar with leave-one-out and block cross-validation. After completing the lab the student shall be able to: (i) (ii) Explain how the bias and variance of a predictor can change when a model is made more complex Explain the advantages and disadvantages with leave-one-out and block crossvalidation. Recommended reading Chapter 7.1-7.5 and 7.10 in Hastie et al. Assignment 1: Trade-off between bias and variance Open the Excel document ‘variancedecomp.xls’ and select the worksheet ‘Summary output’. This document contains a VisualBasic macro that performs simulations of the behaviour of two simple linear predictors. (If you cannot open the macros in this file, you can use Tools  Options Security  Macro security to adjust the security level to medium.) Each simulation involves the following steps: i. Ten pairs (x, y) of data are generated according to a quadratic regression model (y = 0 + 1x + 1x2+) with parameters given in the first column of the worksheet. ii. Two predictors are derived by fitting: (a) a simple linear regression model, and (b) a quadratic regression model to the generated data set iii. Predictions are computed for a new observation for which x = 2 When you click the button ‘Start simulation’ the macro will make repeat the abovementioned steps 1000 times and write the predicted values in columns F and G. Your task is to examine how the bias, variance and mean square error of the two predictors vary with the amount of noise in the generated data, i.e with 2 = Var(). Select suitable levels of  and illustrate the trade-off between bias and variance in a suitable diagram. Assignment 2: Selection of penalty factor using cross-validation Consider the dataset ‘Mortality_rate’ analysed in lab 5. 1. Run proc TPSPLINE without specifying the degrees of freedom or the penalty factor. Then an optimal penalty factor is determined by cross-validation. Make a plot of observed and fitted mortality rates vs Day using proc GPLOT. Compare 732A20 Data Mining and Statistical Learning Department of Computer and Information Science the penalty factor obtained with cross-validation with the factor you considered to be suitable in lab 5. 2. Run proc TPSPLINE with the substatement “ods output GCVFunction=your_data_set” and a suitable collection of penalty factors representing a neighborhood of the optimal penalty factor (use lognlambda). Make a plot of the cross-validation factor (CV) versus the penalty factor. Does the penalty factor obtained in section 1 correspond to a minimum value of CV? Assignment 3: Selection of PLS models using leave-one-out and block cross-validation The Excel file “tecator.xls” analysed in computer lab 3 contains the results of a study aimed to investigate whether a near infrared absorbance spectrum can be used to predict the fat content of samples of meat. Your task is to use two cross-validation options in proc PLS in SAS to select a suitable prediction model for the fat content. Inspect the help file for proc PLS to find out how you can perform leave-one-out cross-validation and block cross-validation. Use all data on the worksheet ‘data’ for your analysis. Compare and comment the results obtained for leave-one-out cross validation and block cross validation using block-sizes 5, 10 and 20. Assignment 4: Simulation study of leave-one-out and block cross-validation Open the Excel document ‘crossvalidation.xls’ and select the worksheet ‘Summary output’. This document contains a VisualBasic macro that performs simulations of the behaviour of leave-one-out and block cross validation. (If you cannot open the macros in this file, you can use Tools  Options  Security  Macro security to adjust the security level to medium.) Each simulation involves the following steps: i) One hundred pairs (x, y) of data are generated according to a quadratic regression model (y = 0 + 1x + 1x2+) with parameters given in the first column of the worksheet. ii) Predictors are derived by leaving out blocks of data and fitting quadratic regression models to the remaining data to the generated data set iii) Predictions are computed for the blocks of observations that have been left out iv) The average sum of prediction errors (ASE) is computed When you click the button ‘Start simulation’ the macro will repeat the abovementioned steps 400 times and write the ASE values in column C. Your task is to examine how the mean and standard deviation of the simulated ASE levels vary with the blocksize. State advantages and disadvantages of using a small block-size. 732A20 Data Mining and Statistical Learning Department of Computer and Information Science To hand in Highlighted items.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Computer lab 6: Model selection and cross validation