Download (a) Let X and Y be jointly normally distributed and uncorrelated

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Geographic information system wikipedia , lookup

Neuroinformatics wikipedia , lookup

Inverse problem wikipedia , lookup

Theoretical computer science wikipedia , lookup

Pattern recognition wikipedia , lookup

Corecursion wikipedia , lookup

Data analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
Multivariate Statistics
Thomas Asendorf, Steffen Unkel
Study sheet 7
Summer term 2017
Exercise 1:
(a) Let X and Y be jointly normally distributed and uncorrelated random variables. Are
X and Y independent? Justify your answer!
(b) Now suppose that X and Y are not jointly normally distributed but each one alone
is marginally normally distributed, and X and Y are uncorrelated. Can you make a
statement whether X and Y are independent? Justify your answer!
Exercise 2:
In the data frame Countries that is part of the R package BiplotGUI eight variables are
measured on 15 countries.
(a) Become acquainted with the Countries data set.
(b) Initialise the graphical user interface (GUI) of the BiplotGUI package with the
Countries data.
(c) Produce a PCA biplot of the centered and scaled Countries data and save the obtained
PCA biplot as a pdf file.
(d) Interpret the point (sample) and axis predictivities of the PCA biplot by means of the
diagrams in the diagnostic tabs “Points” and “Axes”.
(e) How good is quality of fit obtained by the PCA approximation in two dimensions?
(f) Find the adequacy of the representation for each variable.
(g) Use the diagnostic tab “Export” to find the relative absolute errors of any sample point
on the variable GDP and display them in the R console. Calculate the mean relative
absolute error for the variable GDP.
(h) Relate the point for China to their original variable values through the axes. Hints: By
right clicking inside the biplot and selecting “Predict points closest to cursor positions”
from the pop-up menu, an array of orthogonally projecting lines emanates from the
sample point closest to the cursor as it moves. The predictions are also given numerically
in the diagnostic tab “Predictions”.
(i) Create a data frame with China being removed from the rows of Countries. Construct
a PCA biplot of the smaller (centered and scaled) data set and compare it to the PCA
biplot obtained in (c).
Date: 9 June 2017
Page 1
Exercise 3:
To simplify interpretation of principal components, rotation can be used with the objective
of making the rotated components as simple as possible to interpret. However, after rotation,
one or both of the properties of PCA, that is, the orthogonality of loading vectors and the
uncorrelatedness of component scores, disappears. Verify this statement.
Exercise 4:
Prove the following Theorem.1 Let M and N be two matrices of sizes n × p and n × k,
respectively. Consider the constrained minimization problem (also known as a Procrustes
rotation problem):
 = arg min ||M − NA> ||2F
A
subject to A> A = Ik .
Suppose the SVD of M> N is UDV> , then  = UV> .
Exercise 5:
Recall the microarray data NCI60 which are contained in the R package ISLR and that we
already analyzed by means of PCA in the previous class. In this exercise we will perform
sparse PCA on this high-dimensional data set.
(a) Normalize the data NCI60$data so that the variables have mean zero and standard
deviation one and store the normalized data in an object sd.data.
(b) Perform sparse PCA by means of the function arrayspc() from the R package
elasticnet to find the leading sparse component for sd.data.
Hint: arrayspc(x,K=1,para) performs sparse PCA on a microarray matrix x with K
being the number of components and para being a vector of length K of lasso penalties.
(c) By means of the R code shown below, a series of sparse PCA solutions is obtained for
sd.data. Try to understand this code and run it.
library(ISLR)
library(elasticnet)
nci.labs <- NCI60$labs
nci.data <- NCI60$data
sd.data <- scale(nci.data)
lasso <- c(0,0.1,1.0,5.0,10.0,50,100,500,1000,1500)
spca_pev <- c(); nnzero <- c()
for(i in 1:length(lasso)) {
prsparse.out <- arrayspc(sd.data,K=1,para=lasso[i])
spca_pev <- c(spca_pev,prsparse.out$pev)
nnzero <- c(nnzero,sum(prsparse.out$loadings != 0))
}
1
This Theorem corresponds to Theorem 4 in Zou, H., Hastie, T. and Tibshirani, R. (2006): Sparse principal
component analysis, Journal of Computational and Graphical Statistics, Vol. 15, pp. 265-286.
Date: 9 June 2017
Page 2
Call the objects spca_pev and nnzero and create a plot of the percentage of explained
variance versus the number of nonzero loadings for the series of obtained sparse leading
component solutions.
(d) Recall from the previous class that the leading PC explained 11.4% of the total
variance in the data. Which sparse solution sufficiently constructs the leading PC with
an affordable loss of explained variance? Use the loadings of your sparse solution to
compute the component scores of the leading sparse component on the n = 64 cell lines
for the sd.data.
(e) Which of the 64 scores obtained in (d) correspond to which cancer type?
Date: 9 June 2017
Page 3