Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics using R (spring of 2017) Computer lab: Multivariate Plots, Principal Component Analysis and Discriminant Analysis April 5, 2017 Theory Multivariate plots (heat plots) Consider a data matrix X of real numbers with n rows and k columns. We view the rows X1 , . . . , Xn as independent samples from a multivariate distribution with mean vector µ and covariance matrix Σ. Can we visualize X in some easy way? One possibility is to use heat plots (also known as heat maps). These are constructed as follows: think of a rectangular matrix in which column (or row) standardized observation numbers are substituted by colored squares in a continuous color scale, e.g. intensely green color for very negative values, intensely red color for very positive values and some suitable interpolation for the numbers in between. By sorting both rows and columns in various ways, one can get a convenient visual overview of all the vectors. Observe that the data structure and information after such sorting is kept fully intact. In genomics such plots were introduced by Eisen et al. (1998), and in this branch they are often called Eisen-plots after him. An interesting historical sketch is given in (Wilkinson and Friendly, 2009). There are many options available for hierarchical clustering. It can be performed in a sequential manner from the root of a tree (divisive methods) or it can start by merging leaves in the tree (agglomerative methods). The measures of similarity or distance between objects and the groups of objects that are clustered can be varied in many different ways. A quite pedagogical account of traditional clustering terminology and ideas is given by Jain et al. (1999). One option for the sorting is to use hierarchical clustering and display the result as trees in which observation vectors that are similar are displayed near to each other in the row sorting and highly correlated column patterns are also sorted together. Principal component analysis (PCA) Another choice when one wishes to detect subgroups in multidimensional data is to use some dimension reduction technique for the observation vectors. The most common approach is based on a suitable twist (a kind of generalized rotation operation) of the original coordinate system. The method is referred to as principal component analysis and was introduced by Karl Pearson in 1901. Suppose that X is a matrix with n rows, corresponding to differerent multivariate obsevations, and k columns, corresponding to different coordinate values for each observation. 1 The PCA operation then gives a new matrix Y with n rows and k columns. The row vectors Y1 , . . . , Yn has coordinate values calculated as certain linear combinations of the original vectors X1 , . . . , Xn , Yij = A1j Xi1 + A2j Xi2 + . . . + Akj Xik , 1 ≤ i ≤ n, 1 ≤ j ≤ k. Here, the choice of the coefficients Aij for the k by k matrix A are restricted so that A21j + A22j + . . . + A2kj = 1, 1 ≤ j ≤ k. This means that the length of each column vector of the coefficient matrix is equal to 1. As a further restriction, the columns of A are taken to be orthogonal, which may be expressed as A1j A1g + A2j A2g + . . . + Akj Akg = 0, j 6= g. It turns out that it is always possible to choose such normalized and orthogonal column vectors in an optimal way, so that 1. The columns of Y are empirically uncorrelated. 2. The first column of Y has an optimally large empirical variance. This means that the values in this column encode the coordinates of the data in the direction of most variation. Further, the second column of Y contains the coordinates of the data in the direction of second most variation (orthogonal to the first), and so on for the remaining columns. The columns of A may be computed using methods from linear algebra. One approach is to use a singular value decomposition of X. In practice, the calculation is done by a computer program. Once the results are available, we can exploit the fact that most of the variation in the data is captured by the first columns of the matrix Y . Specifically, approximations Ỹ1 , . . . , Ỹn of the row vectors Y1 , . . . , Yn can be obtained by simply dropping the last components (pretending that they are equal to their empirical means). Further, we can use the approximate matrix Ỹ to replace the original observations X1 , . . . , Xn by approximations X̃1 , . . . , X̃n . This is done by multiplying Ỹ with the transpose of A. What has been achieved then is usually referred to as a dimensionality reduction. Performing such a reduction can be very useful when one wants to find out which components that are most important (i.e., explains most of the variation). Theoretical variant The whole data transformation above is determined by the empirical covariance matrix of the original data matrix X. Thus, in the limit of an infinite number of observations n, the theoretical covariance matrix determines the transformation. If one wishes to study properties of principal component methods in large samples, this could be utilized. Notation and plots There are some general terminology related to principal component analysis. The Aij -coefficients are called loadings, and they are important to study because their sizes inform about which variables that are caught in which new coordinates. Their signs also helps interpreting likely explanations for extreme observations on the different principal components. The columns of A are usually referred to as principal component axes. It is very common to plot the first two column vectors of the transformed data Y in a scatterplot. Often, one uses so called Trellis plots, systematic two-dimensional plots for illustration of pairwise combinations of the first 34 principal components (see, e.g., Ihaka (2007)). In such plots the empirical means are often subtracted. 2 A comment on PCA and regression In regression modelling with very high dimensional covariate data and a real data response, say, it is common to use data reduction and apply models where the original covariates are substituted with a set of the largest principal components. This is done in order to avoid overfitting and improve prediction capacity. Whether this is a good idea or not partly depends on the covariance structure, and partly on the regression modelling context. For example, if you think of the problem of “finding a needle in a haystack” in the form of a single important explanatory variable, it might be a bad idea to blur the picture by twisting coordinates. An alternative that is commonly used in some sciences is to compromise between the internal variability pattern of the x-variables and the explanation of the response and maximize the covariance between the first component among all possible orthogonal twists of the coordinate system of the x-variables and the response y. In a second step, the components of the twisted explanatory variables are used in linear regression modelling. This can be generalized to multivariate response variables Y and the whole area is called PLS methods (partial least squares regression). These compromises lead to better predictions when generalized to high dimensions if compared to the traditional method, canonical correlation analysis, which optimizes correlation instead of covariance. Let us also remark here that, in recent years, an explosion of use and suggestions of new methods that try to find a low number of explanatory regression variables among thousands of potential ones have become extremely popular in genomic screening assays. Typically, these regressions penalizes complex explanations in one way or another. One may start learning about the area by reading about LASSO-methods. If PLS is used on correlation matrices, it looks on the surface as if it does the opposite to what LASSO aims at, namely, finding whole groups of correlated variables that together explains the response. However, observe that one can think of advantages also in such approaches. For example, LASSO-methods might miss alternative explanations that might be more natural to find from patterns of loadings in PLS-components. To formulate it drastically, the power of screening methods in genomics is extremely dependent on a deeper biological understanding of the generic biological function possibilities of basic measurement components, an understanding that statisticians rarely have. Thus, research in multidisciplinary teams are necessary not only in application steps, but also when deciding on which methods to use and for the development of appropriate statistical methods. Discriminant analysis Consider the situation in which we have two continuously distributed multidimensional random variables with densities f1 and f2 . Imagine that you have an experiment in which you first toss a biased coin to choose one of the two distributions with probabilities p1 and p2 = 1 − p1 . Then, you draw an observation x from that distribution. You are then asked if you can guess if the observation comes from the first or from the second distribution. It then turns out that one may use a variant of Bayes’ formula to get the a posteriori odds, p1 f1 (x) P(First dist | x) = . P(Second dist | x) p2 f2 (x) Suppose now more specifically that you have two continuously distributed multivariate normally distributed random variables with the same known covariance matrix Σ and two different mean vectors, µ1 and µ2 . It then turns out that the logarithm of the odds is a linear function, P(First dist | x) ln = α + β1 x1 + β2 x2 + . . . + βk xk . P(Second dist | x) 3 This is the typical expression we find in logistic regression models. The coefficients in this linear expression is determined by the two mean vectors, the common covariance matrix and the probability p1 . The rational decision rule (for optimal guess of distribution) can be derived as being a comparison of the logodds with a suitable threshold c. You simply decide that the observation came from distribution 1 if and only if the logodds for the observation x is larger than the threshold c. Now, suppose that you do not know the means nor the covariance matrix, but that you have two samples, one from each of the two distributions. Then one can simply estimate the unknown parameters with empirical estimates and plug them in as if they were true, and out comes an estimate of the logistic regression function above. Or, if both samples are large, one may estimate the β-coefficients adopting a logistic regression GLM procedure as if the group belongings were independent. In this case a separate argument has to be made for the α parameter, since it is essentially determined by the ratio of the sample sizes of the two training data sets used, which need not equal the a priori odds p1 /p2 . Discriminant analysis is an example of a supervised learning procedures. These are in an increasing frequency utilized in all kinds of practical applications in machine learning and data mining contexts. The techniques used in this area are often extensions, with more flexibility, of ideas from regression and multivariate statistics (neural networks, support vector machines and regression tree methods are just a few examples). However, the most natural extensions of discriminant analysis concerns discrimination between more than two groups, say r groups with different mean vectors and a common covariance matrix or extensions to models where the multivariate covariance matrices are allowed to differ between the groups. The first generalization is quite straightforward, while the second leads to more complex log-odds functions with second order interaction terms and quite strange decision patterns for rare outlier observations. A classical survey of discriminant analysis that discusses these questions is (Gnanadesikan, 1989). 4 Tasks Multivariate plots In this part of the lab, we will use the data set golub to show how the R command heatplot may be used to find and illustrate patterns in data. The data set is associated with a study on cancer classification by gene expression analysis done by Golub et al. (1999). The data set is part of the package multtest and must be loaded via Bioconductor. Run the following code in your R session to do this: source("http://bioconductor.org/biocLite.R") biocLite("multtest") library("multtest") data(golub) You should now have access to a matrix golub, which contains gene expression levels for 38 tumor mRNA samples. Rows correspond to genes (3051 genes) and columns to mRNA samples. You can run help(golub) to find our more about the data set. Note in particular that there are two type of tumors, 27 acute lymphoblastic leukemia (ALL) cases and 11 acute myeloid leukemia (AML) cases, where the ALL cases correspond to the first 27 columns of the matrix and the AML cases to the rest. 1. First, plot the data as it is given, without any attempt to impose a cluster structure for the rows or columns: colRamp <- colorRampPalette(c("green", "black", "red"), space="rgb")(64) heatmap(golub, col = colRamp, Rowv = NA, Colv = NA) Looking at the plot, there should be no apparent patterns, even though we know that the last eleven columns (28-38) correspond to AML cases. Note the interpretation of the color palette introduced for the gene expression levels: green = low, black = medium, red = high. 2. Next, redo the plot using the default behaviour of heatmap, which is to create clusters of rows and columns using an euclidian distance function and then reorder them using the mean values of the rows and columns. heatmap(golub, col = colRamp) You should now be able to discern some structure in the gene data. Although the grouping is not perfect, in the sense that it completely separates the ALL and AML cases, you should be able to find two major AML groups by looking at the column tree structure. One of them contains 5 cases, and the other 6 cases. 3. Golub et al. (1999) managed to find a set of genes strongly correlated with the class distinction (ALL vs. AML). Download and source the file genepred.R. This will give you access to a vector pred.genes which contains the row indices of 25 such predictive genes. Make a heat plot using only this subset by running heatmap(golub[pred.genes,], col = colRamp) What do you think about the results? Is there a clear cluster which contains all 11 AML cases now? 5 4. The function heatmap is quite flexible. For example, it is possible to change the distance measure used for the rows and columns (option distfun) and the function used to compute the hierarchical clustering (option hclustfun). Read the help page for heatmap and try to change one or both of these options. Then, make a custom heat plot using all or only the predictive genes with the new options. Can you find a set of options that improves the clustering, in the sense that all (or almost all) AML cases are assigned to a clearly identifiable group? Principal component analysis Explorative uses to find patterns Principal components are often used in order to explore potential differences between groups of data. As an example we may think of a two treatment situation in which systematic effects are added in one of the treatment samples to some of the variables. Such changes should typically result in correlations between the X-components that are affected, and at least if they are large these components have a big chance to have high negative or positive loadings in the first principal components. If one plots each group with a different symbol in the scatterplots, or in the Trellis variants of those, one may often find that the observations in the groups occurs in separate clusters. However, the PCA procedure is not scale invariant, and it is often recommended that all the original X-components (the columns) are standardized before the analysis. This means that the analysis is done on the empirical correlation matrix of the X-variables. It then turns out that the most interesting and clear clustering patterns that only pertain to one of the Xvariables is hard to find in the plots, while treatment effects that have an effect for several of the X-variables still show up. The question of whether to do a PCA with or without scaling is very tricky. In the case when PCA is used in an exploratory situation and the variables are not obviously of different scale (like human weight in grams and length in meters), one can try PCA both with and without standardization. Further, one could also try some variance stabilization transformation like logarithms or square-roots for positive variables. In order to illustrate how this is done in practice, we want you to do the following: Simulation illustrations 1. Simulate a data matrix X with standard normal entries, having n = 100 rows (number of multi-variate observations) and k = 8 columns (number of coordinates of each observation) by running the code n <- 100 k <- 8 X <- matrix(data = rnorm(n * k, mean = 0, sd = 1), nrow = n, ncol = k) In this situation, a principal component analysis will not yield any useful information because of the total symmetry and independence of the simulated numbers. Simulate X a couple of times and apply the R function prcomp to it. Use the functions summary and plot on the output of prcomp and interpret the results. 2. The default behaviour of prcomp is to use no scaling of the variables. This is the option scale = FALSE and corresponds to the use of the empirical variance. Redo the previous task using the empirical correlation instead by changing the option to scale = TRUE when 6 calling prcomp. Can you find any change in the results? Is such a change even expected when we have independent standard normal data? 3. Imagine now that the first 50 and the last 50 observation vectors (rows) in X belong to different treatment groups, and that an effect of 5 should be added to the second component for all 50 observations in the second group. Make this change to the simulated data and verify using summary(prcomp(X)) that a large proportion of the variance is now captured by the first principal component (PC1). Next, plot the results for the two groups using a Trellis plot showing the three first principal components against each other. As a preparatory step, you first need to load the lattice package. The plot can then be constructed using the command splom (scatter plot matrix): library(lattice) Y <- prcomp(X)$x # Create matrix holding principal components as its columns splom(Y[,1:3], col = c(rep("blue", 50), rep("red", 50))) Finally, redefine Y using the option scale = TRUE and make a new plot. Why are the results different? 4. Next, we use another systematic pattern of X, the first, second and √ three columns √ √ where third, are affected by an addition of 5/ 3, −5/ 3 and 5/ 3, respectively (as before, effect is only added to the last 50 rows). Observe that this can be viewed as an effect size similar to the one in the previous step. Again, perform PCAs both with and without scaling and construct plots for both cases. Why are the results different now (compared to the previous step)? 5. Now, let us make a completely different transformation of the data matrix. Construct a new data matrix from X by using the successive cumulative sums of its 8 columns (first column unchanged, second column redefined as sum of two first, third sum redefined as sum of three first, etc.). This of course results in a covariance matrix where we now do not have independence, and in which the last components of the new data matrix have much larger variances. Repeat all the steps above for the new data matrix and comment. In the last with the addition of the three effects, normalize the effect sizes so that you use √ step,p 5/ 3, 5/ 3/2 and 5 for components 1, 2 and 3, respectively. Comment on your results. It should be noted that PCA is not primarily designed for the situation considered in the simulation tasks above. There, we know which object belongs to which group. PCA is instead mainly designed to make data reductions, and to do unsupervised searches for groups with different patterns in multivariate data. Observe that if we ignore the group membership in the scatter plots it is much harder to detect any patterns. It is an interesting experience to combine the data reduction techniques of PCA with heat plots, but we do not really have time for that here. Discriminant analysis In this part the practical application of discriminant analysis (DA) will be illustrated using the R function lda. This function is part of the package MASS, and hence the first thing you should as preparatory step is to run library(MASS) 7 Typically, the DA procedure consists of two steps. Firstly, the function lda is applied to a data set of an appropriate format. This is the training (or learning) part, where the parameters of the procedure are adapted so as to be able to discriminate between observations of data points from a finite number of different classes. Secondly, the output of lda (an R object) is passed as an argument to the function predict, together with a new data set of observations. The goal in the second step is to use the model fitted to the training data to predict the correct class membership of the new observations. As a starting point for the tasks below, the following three functions are provided. These are also found in the file “da functions.R”, available on the course home page. You may find it convenient to take this file as a starting point for the code you will be asked to implement. generate.lda.data <- function(n, p, mu, Sigma) { ## Draw random samples from a finite number of classes k <- length(p) class <- as.factor(sample(1:k, size = n, replace = TRUE, prob = p)) ## Simulate multivariate normal observations conditionally on class membership d <- length(mu[[1]]) ## Dimension of observations X <- matrix(data = NA, nrow = n, ncol = d) for (i in 1:n) X[i,] <- mvrnorm(1, mu = mu[[class[i]]], Sigma = Sigma[[class[i]]]) data.frame(X, class = class) } plot.lda.data <- function(train.data, predict.data, predict.class, cols) { ## Plot training data, and then add prediction data plot(train.data$X1, train.data$X2, pch = 1, col = cols[train.data$class]) points(predict.data$X1, predict.data$X2, pch = 4, col = cols[predict.class]) } test.lda <- function() { n <- 100 p <- c(0.5, 0.5) mu1 <- c(-1, 0); mu2 <- c(1, 0); mu <- list(mu1, mu2) s1 <- 1; s2 <- 1; rho <- 0 sigma <- matrix(c(s1^2, rho * s1 * s2, rho * s1 * s2, s2^2), nrow = 2) Sigma <- list(sigma, sigma) train.data <- generate.lda.data(n, p, mu, Sigma) predict.data <- generate.lda.data(n, p, mu, Sigma) lda.out <- lda(train.data[,1:2], grouping = train.data$class) predict.class <- predict(lda.out, predict.data[,1:2])$class plot.lda.data(train.data, predict.data, predict.class, cols = c("red", "blue")) } 8 The function generate.lda.data samples n multivariate observations from one of length(p) possible classes. The probability that each individual observation is sampled from class i is given by p[i]. Each class consists of multivariate normal random variables. The mean vector and covariance matrix defining the distribution for each class is specified by the i:th entries (mu[[i]] and Sigma[[i]]) of the lists (mu and Sigma). The dimension of each observation equals the length of each mean vector mu[[i]]. The object returned by this function is a data frame with where each row corresponds to one observation and where the last column gives the index of the group that the observation belongs to. Not that this function is used to generate both the training data and the prediction data. The function plot.lda.data takes both the training data and the prediction data as its first two arguments. It also takes the classes predicted by the predict function as its third argument, and finally a vector of colors that maps to the different classes. The result of running this function is a plot which shows both the training data (circles) and the prediction data (crosses). Note that the color of the training data is the one obtained during generation, while the color for the predicted data is the one predicted by the DA algorithm (which need not equal the true class membership as determined when generating the prediction data using generate.lda.data). The function test.lda shows how to specify the necessary arguments and then call the previous two functions in order to plot the results of the analysis. As given above, it may be interpreted as an experiment in which a fair coin is tossed 100 times. For each coin toss, if the result is tails, then a single random observation is taken from a normal bivariate distribution with mean µ1 = (−1, 0). If the result is heads, then a single random observation is taken from a normal bivariate distribution with mean µ2 = (1, 0). These 100 observations then constitute the training data, and lda is applied to it in order to fit the discrimination model. Then, another 100 tosses is made in exactly the same way in order to generate the prediction data. 1. Take a few minutes and try out what happens when you change the parameters in the function test.lda. What happens if you change mu1 and mu2? What happens if you change the correlation coefficient rho for the covariance matrix? Also, change the probabilities of the two classes and verify that the distribution of observations changes accordingly. 2. Since the plot produced by plot.lda.data only shows the predicted class (crosses) for the data used for prediction, and not the true class, it is not possible to see if any of the observations has been misclassified by the DA procedure. In order to remedy this, modify plot.lda.data so that the true class of the predicted data is also shown using a box with the correct color around the crosses (hint: use pch = 0 for boxes). To do this, it should be enough to add a single line to the code at the end of the function. Then, use your new function to estimate the number of observations in predict.data that are misclassified. In what follows, use your new plotting function since it shows a superset of the information shown in the more basic version. 3. Suppose now that the class probabilites for the data to be classified are known to be very different from the ones used for the training data. Assume for example that the probabilities are (p1 , p2 ) = (0.9, 0.1). Then it is still quite likely that a data point which is close to the second cluster will belong to the first. However, this aspect is not captured by the DA prediction unless these new probabilities are passed on as a parameter to the function predict. If this is not done, the empirical probabilities of the training data will be used also for prediction. Use ?predict.lda to find out how the call to predict must be modified in order to change the probabilites when performing prediction. Now, keeping the generation of the training exactly as before, use (p1 , p2 ) = (0.9, 0.1) during generation of the prediction data. Then, compare the performance of the predictions (by counting 9 misclassifications) with and without passing these probabilities to the predict function. Can you see any improvement? 4. So far, we’ve only used two classes. However, the functions introduced for generating and plotting data supports using more groups of bivariate normal distributions. Extend the test.lda function with more groups (say a total of 3 or 4) and try out some different mean vectors and covariance matrices. Do some plots to get a feeling for how the classifications (and misclassifications) depend on the mean vectors and covariance matrices of the groups. You may also want to change the allocation (group) probabilities and the total number of samples taken. Finally, extend the test.lda function to that it also computes the proportion of misclassified observations among the ones the are predicted. 5. In the final subtask, you will try out the lda function on a real data set. The data set used is called “iris”. Load it by calling data(iris) and read about it using ?iris. The main change now is that each observation is done on four variables instead of just two. It is therefore more challenging to graphically present the results, but you could try to use what you’ve learned so far in the course and think about some appropriate way to do it. However, the main task is to check how well DA discriminates by computing the proportion of misclassified observations. Do this by using the first 25 observations in each class of species (setosa, versicolor and virginica) as training data, and the remaining 25 observations in each group as prediction data (it may be a good idea to first create two separate data frames from iris, one holding the training data and the other holding the prediction data). 10 References Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95(25):14863–14868. Gnanadesikan, R. (1989). Discriminant analysis and clustering: Panel on discriminant analysis, classification, and clustering. Statistical Science, 4(1):34–69. http://projecteuclid.org/ download/pdf_1/euclid.ss/1177012666. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and S., L. E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, New Series, 286(5439):531–537. Ihaka, R. (2007). Trellis plots. lectures-trellis.pdf. https://www.stat.auckland.ac.nz/~ihaka/787/ Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3):265–323. Wilkinson, L. and Friendly, M. (2009). The history of the cluster heat map. The American Statistician, 63(2):179–184. 11