Download (a) Let X and Y be jointly normally distributed and uncorrelated

Multivariate Statistics Thomas Asendorf, Steffen Unkel Study sheet 7 Summer term 2017 Exercise 1: (a) Let X and Y be jointly normally distributed and uncorrelated random variables. Are X and Y independent? Justify your answer! (b) Now suppose that X and Y are not jointly normally distributed but each one alone is marginally normally distributed, and X and Y are uncorrelated. Can you make a statement whether X and Y are independent? Justify your answer! Exercise 2: In the data frame Countries that is part of the R package BiplotGUI eight variables are measured on 15 countries. (a) Become acquainted with the Countries data set. (b) Initialise the graphical user interface (GUI) of the BiplotGUI package with the Countries data. (c) Produce a PCA biplot of the centered and scaled Countries data and save the obtained PCA biplot as a pdf file. (d) Interpret the point (sample) and axis predictivities of the PCA biplot by means of the diagrams in the diagnostic tabs “Points” and “Axes”. (e) How good is quality of fit obtained by the PCA approximation in two dimensions? (f) Find the adequacy of the representation for each variable. (g) Use the diagnostic tab “Export” to find the relative absolute errors of any sample point on the variable GDP and display them in the R console. Calculate the mean relative absolute error for the variable GDP. (h) Relate the point for China to their original variable values through the axes. Hints: By right clicking inside the biplot and selecting “Predict points closest to cursor positions” from the pop-up menu, an array of orthogonally projecting lines emanates from the sample point closest to the cursor as it moves. The predictions are also given numerically in the diagnostic tab “Predictions”. (i) Create a data frame with China being removed from the rows of Countries. Construct a PCA biplot of the smaller (centered and scaled) data set and compare it to the PCA biplot obtained in (c). Date: 9 June 2017 Page 1 Exercise 3: To simplify interpretation of principal components, rotation can be used with the objective of making the rotated components as simple as possible to interpret. However, after rotation, one or both of the properties of PCA, that is, the orthogonality of loading vectors and the uncorrelatedness of component scores, disappears. Verify this statement. Exercise 4: Prove the following Theorem.1 Let M and N be two matrices of sizes n × p and n × k, respectively. Consider the constrained minimization problem (also known as a Procrustes rotation problem): Â = arg min ||M − NA> ||2F A subject to A> A = Ik . Suppose the SVD of M> N is UDV> , then Â = UV> . Exercise 5: Recall the microarray data NCI60 which are contained in the R package ISLR and that we already analyzed by means of PCA in the previous class. In this exercise we will perform sparse PCA on this high-dimensional data set. (a) Normalize the data NCI60$data so that the variables have mean zero and standard deviation one and store the normalized data in an object sd.data. (b) Perform sparse PCA by means of the function arrayspc() from the R package elasticnet to find the leading sparse component for sd.data. Hint: arrayspc(x,K=1,para) performs sparse PCA on a microarray matrix x with K being the number of components and para being a vector of length K of lasso penalties. (c) By means of the R code shown below, a series of sparse PCA solutions is obtained for sd.data. Try to understand this code and run it. library(ISLR) library(elasticnet) nci.labs <- NCI60$labs nci.data <- NCI60$data sd.data <- scale(nci.data) lasso <- c(0,0.1,1.0,5.0,10.0,50,100,500,1000,1500) spca_pev <- c(); nnzero <- c() for(i in 1:length(lasso)) { prsparse.out <- arrayspc(sd.data,K=1,para=lasso[i]) spca_pev <- c(spca_pev,prsparse.out$pev) nnzero <- c(nnzero,sum(prsparse.out$loadings != 0)) } 1 This Theorem corresponds to Theorem 4 in Zou, H., Hastie, T. and Tibshirani, R. (2006): Sparse principal component analysis, Journal of Computational and Graphical Statistics, Vol. 15, pp. 265-286. Date: 9 June 2017 Page 2 Call the objects spca_pev and nnzero and create a plot of the percentage of explained variance versus the number of nonzero loadings for the series of obtained sparse leading component solutions. (d) Recall from the previous class that the leading PC explained 11.4% of the total variance in the data. Which sparse solution sufficiently constructs the leading PC with an affordable loss of explained variance? Use the loadings of your sparse solution to compute the component scores of the leading sparse component on the n = 64 cell lines for the sd.data. (e) Which of the 64 scores obtained in (d) correspond to which cancer type? Date: 9 June 2017 Page 3

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download (a) Let X and Y be jointly normally distributed and uncorrelated