Download Computer lab 3: Principal component analysis and partial least

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Computer lab 3: Principal component analysis and
partial least squares regression
Learning objectives
The main objective of this computer lab is to make the student familiar with the concept
of principal components and the major pros and cons of principal components regression
(PCR) and partial least squares regression (PLS).
After completing the lab the student shall be able to:
(i)
Explain how principal components can be derived from a data matrix by
computing eigenvectors of a covariance or correlation matrix,
(ii)
Interpret a score plot
(iii)
Undertake a PCR and PLS regression in SAS Enterprise Miner and interpret
the estimated parameters and performance measures.
(iv)
Use a cross-validation technique for model selection
Recommended reading
Chapters 3.1 – 3.4.3 in Hastie et al.
Assignment 1: Dimension reduction and principal components
The Excel document ‘fluorescein.xls’ contains data regarding light reflection from a total
of 30 surfaces treated with zinc, rhodamine, manganese etc. For each surface,
measurements are undertaken for light representing 146 different wavelengths (channels),
i.e., the data have 146 dimensions. However, the measured values for adjacent channels
are strongly correlated, implying that the effective dimension of the measured values is
much smaller than 146.
Your task is to use SAS Enterprise Miner to create a diagram for principal components
analysis of the given data and to interpret the results of that analysis. For the sake of
simplicity the channels for which there was no variation in the light reflection have been
omitted.
Create a new diagram
Define a new diagram and draw a data flow with the nodes Input Data Source and
Princomp/Dmneural.
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Assign a data set to the Input Data Source
Inspect the light reflection data to be analysed by making a lineplot (in Excel) of the data
matrix in ‘fluorescein.xls’. Are data representing adjacent channels (wavelengths)
strongly correlated?
Import the Excel file ‘fluorescein.xls’ to SAS and assign this SAS data set to the Input
Data Source node. Define appropriate model roles.
Extract principal components
Open the Princomp/Dmneural node and select principal components analysis based on
the covariance matrix of the given data. Run the cited node and examine the effective
dimension of the given data, i.e., how many principal components that are needed to
explain most of the variation in the data.
Draw a score plot
Add a Distribution Explorer node to the current workflow diagram. Make a plot of the
observed data in the coordinate system defined by the first two principal components.
(Use the Set Axis menu to assign the X and Y items to PRIN1 and PRIN2, respectively.)
Can you identify distinct groups of objects in this new coordinate system?
Assign a new data set to the Input Data Source
Assign the SAS data set derived from the file ‘lakesurvey.xls’ to the Input Data Source
node.
Extract and interpret the principal components
Extract principal components using the covariance and correlation matrices, respectively.
Why are the principal components so different in the two cases? Is it possible to assign a
physico-chemical meaning to the first two principal components derived from the
correlation matrix?
Assignment 2: Principal components regression and partial
least squares regression
The Excel file “tecator.xls” contains the results of study aimed to investigate whether a
near infrared absorbance spectrum can be used to predict the fat content of samples of
meat. For each meat sample the data consists of a 100 channel spectrum of absorbance
records and the levels of moisture (water), fat and protein. The absorbance is -log10 of
the transmittance measured by the spectrometer. The moisture, fat and protein are
determined by analytic chemistry.
The worksheet “data” contains data from 215 samples of finely chopped meat. Your task
is to establish PCR and PLS models in which the fat content is regarded as target and the
absorbance levels recorded in the 100 channels are regarded as explanatory variables.
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Run proc PLS for a PLS regression and a PCR analysis
Import the worksheet ‘data’ in ‘tecator.xls’ to SAS. Open the log window of SAS and
check that the file has been successfully imported.
Open the Editor window and write a SAS code in which proc PLS first performs a PLS
regression and then a PCR analysis. Use cross validation for model selection splitting the
entire data set into two parts. Compute the Root ASE for the test set.
Run Enterprise Miner for an ordinary least squares regression
Run the regression node in Enterprise miner to undertake an ordinary least squares
regression using the same training and test sets as in the previous tasks. Use forward,
backward and stepwise regression for model selection and note the Root ASE for the test
set.
Compare and comment the results obtained for the different regression models. What
model would you prefer?
To hand in
Assignment 1: The score plot and your interpretation of that plot. Your explanation to
why the covariance matrix and the correlation matrix produce different eigenvectors and
eigenvalues.
Assignment 2: A table of Root ASE values of your models and your interpretation of the
obtained values.