Download Computer lab 6: Model selection and cross validation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Exploratory factor analysis wikipedia , lookup

Transcript
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Computer lab 6: Model selection and cross validation
Learning objectives
The main objective of this computer lab is to increase the understanding of the trade-off
between bias and variance in statistical prediction and to make the student familiar with
leave-one-out and block cross-validation.
After completing the lab the student shall be able to:
(i)
(ii)
Explain how the bias and variance of a predictor can change when a model is
made more complex
Explain the advantages and disadvantages with leave-one-out and block crossvalidation.
Recommended reading
Chapter 7.1-7.5 and 7.10 in Hastie et al.
Assignment 1: Trade-off between bias and variance
Open the Excel document ‘variancedecomp.xls’ and select the worksheet ‘Summary
output’. This document contains a VisualBasic macro that performs simulations of the
behaviour of two simple linear predictors. (If you cannot open the macros in this file, you
can use Tools  Options Security  Macro security to adjust the security level to
medium.) Each simulation involves the following steps:
i.
Ten pairs (x, y) of data are generated according to a quadratic regression model (y
= 0 + 1x + 1x2+) with parameters given in the first column of the worksheet.
ii. Two predictors are derived by fitting: (a) a simple linear regression model, and (b)
a quadratic regression model to the generated data set
iii. Predictions are computed for a new observation for which x = 2
When you click the button ‘Start simulation’ the macro will make repeat the
abovementioned steps 1000 times and write the predicted values in columns F and G.
Your task is to examine how the bias, variance and mean square error of the two
predictors vary with the amount of noise in the generated data, i.e with 2 = Var().
Select suitable levels of  and illustrate the trade-off between bias and variance in a
suitable diagram.
Assignment 2: Selection of penalty factor using cross-validation
Consider the dataset ‘Mortality_rate’ analysed in lab 5.
1. Run proc TPSPLINE without specifying the degrees of freedom or the penalty
factor. Then an optimal penalty factor is determined by cross-validation. Make a
plot of observed and fitted mortality rates vs Day using proc GPLOT. Compare
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
the penalty factor obtained with cross-validation with the factor you considered to
be suitable in lab 5.
2. Run proc TPSPLINE with the substatement “ods output
GCVFunction=your_data_set” and a suitable collection of penalty factors
representing a neighborhood of the optimal penalty factor (use lognlambda).
Make a plot of the cross-validation factor (CV) versus the penalty factor. Does the
penalty factor obtained in section 1 correspond to a minimum value of CV?
Assignment 3: Selection of PLS models using leave-one-out and
block cross-validation
The Excel file “tecator.xls” analysed in computer lab 3 contains the results of a study
aimed to investigate whether a near infrared absorbance spectrum can be used to predict
the fat content of samples of meat. Your task is to use two cross-validation options in
proc PLS in SAS to select a suitable prediction model for the fat content. Inspect the help
file for proc PLS to find out how you can perform leave-one-out cross-validation and
block cross-validation.
Use all data on the worksheet ‘data’ for your analysis. Compare and comment the results
obtained for leave-one-out cross validation and block cross validation using block-sizes
5, 10 and 20.
Assignment 4: Simulation study of leave-one-out and block
cross-validation
Open the Excel document ‘crossvalidation.xls’ and select the worksheet ‘Summary
output’. This document contains a VisualBasic macro that performs simulations of the
behaviour of leave-one-out and block cross validation. (If you cannot open the macros in
this file, you can use Tools  Options  Security  Macro security to adjust the
security level to medium.) Each simulation involves the following steps:
i)
One hundred pairs (x, y) of data are generated according to a quadratic
regression model (y = 0 + 1x + 1x2+) with parameters given in the first
column of the worksheet.
ii)
Predictors are derived by leaving out blocks of data and fitting quadratic
regression models to the remaining data to the generated data set
iii)
Predictions are computed for the blocks of observations that have been left out
iv)
The average sum of prediction errors (ASE) is computed
When you click the button ‘Start simulation’ the macro will repeat the abovementioned
steps 400 times and write the ASE values in column C.
Your task is to examine how the mean and standard deviation of the simulated ASE
levels vary with the blocksize. State advantages and disadvantages of using a small
block-size.
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
To hand in
Highlighted items.