Download Computer lab: Multivariate Plots, Principal Component Analysis and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Principal component analysis wikipedia , lookup

Transcript
Statistics using R (spring of 2017)
Computer lab:
Multivariate Plots, Principal Component Analysis and
Discriminant Analysis
April 5, 2017
Theory
Multivariate plots (heat plots)
Consider a data matrix X of real numbers with n rows and k columns. We view the rows
X1 , . . . , Xn as independent samples from a multivariate distribution with mean vector µ and
covariance matrix Σ. Can we visualize X in some easy way? One possibility is to use heat plots
(also known as heat maps). These are constructed as follows: think of a rectangular matrix in
which column (or row) standardized observation numbers are substituted by colored squares in
a continuous color scale, e.g. intensely green color for very negative values, intensely red color
for very positive values and some suitable interpolation for the numbers in between. By sorting
both rows and columns in various ways, one can get a convenient visual overview of all the
vectors. Observe that the data structure and information after such sorting is kept fully intact.
In genomics such plots were introduced by Eisen et al. (1998), and in this branch they are often
called Eisen-plots after him. An interesting historical sketch is given in (Wilkinson and Friendly,
2009).
There are many options available for hierarchical clustering. It can be performed in a sequential manner from the root of a tree (divisive methods) or it can start by merging leaves in the
tree (agglomerative methods). The measures of similarity or distance between objects and the
groups of objects that are clustered can be varied in many different ways. A quite pedagogical
account of traditional clustering terminology and ideas is given by Jain et al. (1999).
One option for the sorting is to use hierarchical clustering and display the result as trees in
which observation vectors that are similar are displayed near to each other in the row sorting
and highly correlated column patterns are also sorted together.
Principal component analysis (PCA)
Another choice when one wishes to detect subgroups in multidimensional data is to use some
dimension reduction technique for the observation vectors. The most common approach is based
on a suitable twist (a kind of generalized rotation operation) of the original coordinate system.
The method is referred to as principal component analysis and was introduced by Karl Pearson
in 1901. Suppose that X is a matrix with n rows, corresponding to differerent multivariate
obsevations, and k columns, corresponding to different coordinate values for each observation.
1
The PCA operation then gives a new matrix Y with n rows and k columns. The row vectors
Y1 , . . . , Yn has coordinate values calculated as certain linear combinations of the original vectors
X1 , . . . , Xn ,
Yij = A1j Xi1 + A2j Xi2 + . . . + Akj Xik ,
1 ≤ i ≤ n,
1 ≤ j ≤ k.
Here, the choice of the coefficients Aij for the k by k matrix A are restricted so that
A21j + A22j + . . . + A2kj = 1,
1 ≤ j ≤ k.
This means that the length of each column vector of the coefficient matrix is equal to 1. As a
further restriction, the columns of A are taken to be orthogonal, which may be expressed as
A1j A1g + A2j A2g + . . . + Akj Akg = 0,
j 6= g.
It turns out that it is always possible to choose such normalized and orthogonal column vectors
in an optimal way, so that
1. The columns of Y are empirically uncorrelated.
2. The first column of Y has an optimally large empirical variance. This means that the
values in this column encode the coordinates of the data in the direction of most variation.
Further, the second column of Y contains the coordinates of the data in the direction of
second most variation (orthogonal to the first), and so on for the remaining columns.
The columns of A may be computed using methods from linear algebra. One approach is
to use a singular value decomposition of X. In practice, the calculation is done by a computer
program. Once the results are available, we can exploit the fact that most of the variation in the
data is captured by the first columns of the matrix Y . Specifically, approximations Ỹ1 , . . . , Ỹn of
the row vectors Y1 , . . . , Yn can be obtained by simply dropping the last components (pretending
that they are equal to their empirical means). Further, we can use the approximate matrix Ỹ
to replace the original observations X1 , . . . , Xn by approximations X̃1 , . . . , X̃n . This is done by
multiplying Ỹ with the transpose of A. What has been achieved then is usually referred to as
a dimensionality reduction. Performing such a reduction can be very useful when one wants to
find out which components that are most important (i.e., explains most of the variation).
Theoretical variant
The whole data transformation above is determined by the empirical covariance matrix of the
original data matrix X. Thus, in the limit of an infinite number of observations n, the theoretical
covariance matrix determines the transformation. If one wishes to study properties of principal
component methods in large samples, this could be utilized.
Notation and plots
There are some general terminology related to principal component analysis. The Aij -coefficients
are called loadings, and they are important to study because their sizes inform about which
variables that are caught in which new coordinates. Their signs also helps interpreting likely
explanations for extreme observations on the different principal components. The columns of
A are usually referred to as principal component axes. It is very common to plot the first two
column vectors of the transformed data Y in a scatterplot. Often, one uses so called Trellis
plots, systematic two-dimensional plots for illustration of pairwise combinations of the first 34 principal components (see, e.g., Ihaka (2007)). In such plots the empirical means are often
subtracted.
2
A comment on PCA and regression
In regression modelling with very high dimensional covariate data and a real data response,
say, it is common to use data reduction and apply models where the original covariates are
substituted with a set of the largest principal components. This is done in order to avoid
overfitting and improve prediction capacity. Whether this is a good idea or not partly depends
on the covariance structure, and partly on the regression modelling context. For example, if
you think of the problem of “finding a needle in a haystack” in the form of a single important
explanatory variable, it might be a bad idea to blur the picture by twisting coordinates.
An alternative that is commonly used in some sciences is to compromise between the internal variability pattern of the x-variables and the explanation of the response and maximize the
covariance between the first component among all possible orthogonal twists of the coordinate
system of the x-variables and the response y. In a second step, the components of the twisted
explanatory variables are used in linear regression modelling. This can be generalized to multivariate response variables Y and the whole area is called PLS methods (partial least squares
regression). These compromises lead to better predictions when generalized to high dimensions if
compared to the traditional method, canonical correlation analysis, which optimizes correlation
instead of covariance.
Let us also remark here that, in recent years, an explosion of use and suggestions of new
methods that try to find a low number of explanatory regression variables among thousands
of potential ones have become extremely popular in genomic screening assays. Typically, these
regressions penalizes complex explanations in one way or another. One may start learning about
the area by reading about LASSO-methods. If PLS is used on correlation matrices, it looks on
the surface as if it does the opposite to what LASSO aims at, namely, finding whole groups of
correlated variables that together explains the response. However, observe that one can think
of advantages also in such approaches. For example, LASSO-methods might miss alternative
explanations that might be more natural to find from patterns of loadings in PLS-components.
To formulate it drastically, the power of screening methods in genomics is extremely dependent on a deeper biological understanding of the generic biological function possibilities of basic
measurement components, an understanding that statisticians rarely have. Thus, research in
multidisciplinary teams are necessary not only in application steps, but also when deciding on
which methods to use and for the development of appropriate statistical methods.
Discriminant analysis
Consider the situation in which we have two continuously distributed multidimensional random
variables with densities f1 and f2 . Imagine that you have an experiment in which you first toss
a biased coin to choose one of the two distributions with probabilities p1 and p2 = 1 − p1 . Then,
you draw an observation x from that distribution. You are then asked if you can guess if the
observation comes from the first or from the second distribution. It then turns out that one may
use a variant of Bayes’ formula to get the a posteriori odds,
p1 f1 (x)
P(First dist | x)
=
.
P(Second dist | x)
p2 f2 (x)
Suppose now more specifically that you have two continuously distributed multivariate normally distributed random variables with the same known covariance matrix Σ and two different
mean vectors, µ1 and µ2 . It then turns out that the logarithm of the odds is a linear function,
P(First dist | x)
ln
= α + β1 x1 + β2 x2 + . . . + βk xk .
P(Second dist | x)
3
This is the typical expression we find in logistic regression models. The coefficients in this
linear expression is determined by the two mean vectors, the common covariance matrix and
the probability p1 . The rational decision rule (for optimal guess of distribution) can be derived
as being a comparison of the logodds with a suitable threshold c. You simply decide that the
observation came from distribution 1 if and only if the logodds for the observation x is larger
than the threshold c.
Now, suppose that you do not know the means nor the covariance matrix, but that you have
two samples, one from each of the two distributions. Then one can simply estimate the unknown
parameters with empirical estimates and plug them in as if they were true, and out comes an
estimate of the logistic regression function above. Or, if both samples are large, one may estimate
the β-coefficients adopting a logistic regression GLM procedure as if the group belongings were
independent. In this case a separate argument has to be made for the α parameter, since it is
essentially determined by the ratio of the sample sizes of the two training data sets used, which
need not equal the a priori odds p1 /p2 .
Discriminant analysis is an example of a supervised learning procedures. These are in an
increasing frequency utilized in all kinds of practical applications in machine learning and data
mining contexts. The techniques used in this area are often extensions, with more flexibility,
of ideas from regression and multivariate statistics (neural networks, support vector machines
and regression tree methods are just a few examples). However, the most natural extensions
of discriminant analysis concerns discrimination between more than two groups, say r groups
with different mean vectors and a common covariance matrix or extensions to models where the
multivariate covariance matrices are allowed to differ between the groups. The first generalization
is quite straightforward, while the second leads to more complex log-odds functions with second
order interaction terms and quite strange decision patterns for rare outlier observations. A
classical survey of discriminant analysis that discusses these questions is (Gnanadesikan, 1989).
4
Tasks
Multivariate plots
In this part of the lab, we will use the data set golub to show how the R command heatplot
may be used to find and illustrate patterns in data. The data set is associated with a study on
cancer classification by gene expression analysis done by Golub et al. (1999). The data set is
part of the package multtest and must be loaded via Bioconductor. Run the following code in
your R session to do this:
source("http://bioconductor.org/biocLite.R")
biocLite("multtest")
library("multtest")
data(golub)
You should now have access to a matrix golub, which contains gene expression levels for 38
tumor mRNA samples. Rows correspond to genes (3051 genes) and columns to mRNA samples.
You can run help(golub) to find our more about the data set. Note in particular that there are
two type of tumors, 27 acute lymphoblastic leukemia (ALL) cases and 11 acute myeloid leukemia
(AML) cases, where the ALL cases correspond to the first 27 columns of the matrix and the AML
cases to the rest.
1. First, plot the data as it is given, without any attempt to impose a cluster structure for
the rows or columns:
colRamp <- colorRampPalette(c("green", "black", "red"), space="rgb")(64)
heatmap(golub, col = colRamp, Rowv = NA, Colv = NA)
Looking at the plot, there should be no apparent patterns, even though we know that the
last eleven columns (28-38) correspond to AML cases. Note the interpretation of the color
palette introduced for the gene expression levels: green = low, black = medium, red =
high.
2. Next, redo the plot using the default behaviour of heatmap, which is to create clusters of
rows and columns using an euclidian distance function and then reorder them using the
mean values of the rows and columns.
heatmap(golub, col = colRamp)
You should now be able to discern some structure in the gene data. Although the grouping
is not perfect, in the sense that it completely separates the ALL and AML cases, you should
be able to find two major AML groups by looking at the column tree structure. One of
them contains 5 cases, and the other 6 cases.
3. Golub et al. (1999) managed to find a set of genes strongly correlated with the class
distinction (ALL vs. AML). Download and source the file genepred.R. This will give you
access to a vector pred.genes which contains the row indices of 25 such predictive genes.
Make a heat plot using only this subset by running
heatmap(golub[pred.genes,], col = colRamp)
What do you think about the results? Is there a clear cluster which contains all 11 AML
cases now?
5
4. The function heatmap is quite flexible. For example, it is possible to change the distance
measure used for the rows and columns (option distfun) and the function used to compute
the hierarchical clustering (option hclustfun). Read the help page for heatmap and try
to change one or both of these options. Then, make a custom heat plot using all or only
the predictive genes with the new options. Can you find a set of options that improves
the clustering, in the sense that all (or almost all) AML cases are assigned to a clearly
identifiable group?
Principal component analysis
Explorative uses to find patterns
Principal components are often used in order to explore potential differences between groups of
data. As an example we may think of a two treatment situation in which systematic effects are
added in one of the treatment samples to some of the variables. Such changes should typically
result in correlations between the X-components that are affected, and at least if they are large
these components have a big chance to have high negative or positive loadings in the first principal
components. If one plots each group with a different symbol in the scatterplots, or in the Trellis
variants of those, one may often find that the observations in the groups occurs in separate
clusters. However, the PCA procedure is not scale invariant, and it is often recommended that
all the original X-components (the columns) are standardized before the analysis. This means
that the analysis is done on the empirical correlation matrix of the X-variables. It then turns
out that the most interesting and clear clustering patterns that only pertain to one of the Xvariables is hard to find in the plots, while treatment effects that have an effect for several of the
X-variables still show up.
The question of whether to do a PCA with or without scaling is very tricky. In the case
when PCA is used in an exploratory situation and the variables are not obviously of different
scale (like human weight in grams and length in meters), one can try PCA both with and
without standardization. Further, one could also try some variance stabilization transformation
like logarithms or square-roots for positive variables. In order to illustrate how this is done in
practice, we want you to do the following:
Simulation illustrations
1. Simulate a data matrix X with standard normal entries, having n = 100 rows (number of
multi-variate observations) and k = 8 columns (number of coordinates of each observation)
by running the code
n <- 100
k <- 8
X <- matrix(data = rnorm(n * k, mean = 0, sd = 1), nrow = n, ncol = k)
In this situation, a principal component analysis will not yield any useful information
because of the total symmetry and independence of the simulated numbers. Simulate X a
couple of times and apply the R function prcomp to it. Use the functions summary and
plot on the output of prcomp and interpret the results.
2. The default behaviour of prcomp is to use no scaling of the variables. This is the option
scale = FALSE and corresponds to the use of the empirical variance. Redo the previous
task using the empirical correlation instead by changing the option to scale = TRUE when
6
calling prcomp. Can you find any change in the results? Is such a change even expected
when we have independent standard normal data?
3. Imagine now that the first 50 and the last 50 observation vectors (rows) in X belong to
different treatment groups, and that an effect of 5 should be added to the second component
for all 50 observations in the second group. Make this change to the simulated data and
verify using summary(prcomp(X)) that a large proportion of the variance is now captured by
the first principal component (PC1). Next, plot the results for the two groups using a Trellis
plot showing the three first principal components against each other. As a preparatory step,
you first need to load the lattice package. The plot can then be constructed using the
command splom (scatter plot matrix):
library(lattice)
Y <- prcomp(X)$x # Create matrix holding principal components as its columns
splom(Y[,1:3], col = c(rep("blue", 50), rep("red", 50)))
Finally, redefine Y using the option scale = TRUE and make a new plot. Why are the
results different?
4. Next, we use another systematic pattern
of X, the first, second and
√ three columns
√
√ where
third, are affected by an addition of 5/ 3, −5/ 3 and 5/ 3, respectively (as before, effect
is only added to the last 50 rows). Observe that this can be viewed as an effect size similar
to the one in the previous step. Again, perform PCAs both with and without scaling and
construct plots for both cases. Why are the results different now (compared to the previous
step)?
5. Now, let us make a completely different transformation of the data matrix. Construct a
new data matrix from X by using the successive cumulative sums of its 8 columns (first
column unchanged, second column redefined as sum of two first, third sum redefined as
sum of three first, etc.). This of course results in a covariance matrix where we now do not
have independence, and in which the last components of the new data matrix have much
larger variances. Repeat all the steps above for the new data matrix and comment. In the
last
with the addition of the three effects, normalize the effect sizes so that you use
√ step,p
5/ 3, 5/ 3/2 and 5 for components 1, 2 and 3, respectively. Comment on your results.
It should be noted that PCA is not primarily designed for the situation considered in the
simulation tasks above. There, we know which object belongs to which group. PCA is instead
mainly designed to make data reductions, and to do unsupervised searches for groups with
different patterns in multivariate data. Observe that if we ignore the group membership in the
scatter plots it is much harder to detect any patterns. It is an interesting experience to combine
the data reduction techniques of PCA with heat plots, but we do not really have time for that
here.
Discriminant analysis
In this part the practical application of discriminant analysis (DA) will be illustrated using the
R function lda. This function is part of the package MASS, and hence the first thing you should
as preparatory step is to run
library(MASS)
7
Typically, the DA procedure consists of two steps. Firstly, the function lda is applied to a data
set of an appropriate format. This is the training (or learning) part, where the parameters of
the procedure are adapted so as to be able to discriminate between observations of data points
from a finite number of different classes. Secondly, the output of lda (an R object) is passed as
an argument to the function predict, together with a new data set of observations. The goal
in the second step is to use the model fitted to the training data to predict the correct class
membership of the new observations. As a starting point for the tasks below, the following three
functions are provided. These are also found in the file “da functions.R”, available on the course
home page. You may find it convenient to take this file as a starting point for the code you will
be asked to implement.
generate.lda.data <- function(n, p, mu, Sigma) {
## Draw random samples from a finite number of classes
k <- length(p)
class <- as.factor(sample(1:k, size = n, replace = TRUE, prob = p))
## Simulate multivariate normal observations conditionally on class membership
d <- length(mu[[1]]) ## Dimension of observations
X <- matrix(data = NA, nrow = n, ncol = d)
for (i in 1:n)
X[i,] <- mvrnorm(1, mu = mu[[class[i]]], Sigma = Sigma[[class[i]]])
data.frame(X, class = class)
}
plot.lda.data <- function(train.data, predict.data, predict.class, cols) {
## Plot training data, and then add prediction data
plot(train.data$X1, train.data$X2, pch = 1, col = cols[train.data$class])
points(predict.data$X1, predict.data$X2, pch = 4, col = cols[predict.class])
}
test.lda <- function() {
n <- 100
p <- c(0.5, 0.5)
mu1 <- c(-1, 0); mu2 <- c(1, 0); mu <- list(mu1, mu2)
s1 <- 1; s2 <- 1; rho <- 0
sigma <- matrix(c(s1^2, rho * s1 * s2, rho * s1 * s2, s2^2), nrow = 2)
Sigma <- list(sigma, sigma)
train.data <- generate.lda.data(n, p, mu, Sigma)
predict.data <- generate.lda.data(n, p, mu, Sigma)
lda.out <- lda(train.data[,1:2], grouping = train.data$class)
predict.class <- predict(lda.out, predict.data[,1:2])$class
plot.lda.data(train.data, predict.data, predict.class, cols = c("red", "blue"))
}
8
The function generate.lda.data samples n multivariate observations from one of length(p)
possible classes. The probability that each individual observation is sampled from class i is given
by p[i]. Each class consists of multivariate normal random variables. The mean vector and
covariance matrix defining the distribution for each class is specified by the i:th entries (mu[[i]]
and Sigma[[i]]) of the lists (mu and Sigma). The dimension of each observation equals the
length of each mean vector mu[[i]]. The object returned by this function is a data frame with
where each row corresponds to one observation and where the last column gives the index of
the group that the observation belongs to. Not that this function is used to generate both the
training data and the prediction data.
The function plot.lda.data takes both the training data and the prediction data as its first
two arguments. It also takes the classes predicted by the predict function as its third argument,
and finally a vector of colors that maps to the different classes. The result of running this function
is a plot which shows both the training data (circles) and the prediction data (crosses). Note
that the color of the training data is the one obtained during generation, while the color for the
predicted data is the one predicted by the DA algorithm (which need not equal the true class
membership as determined when generating the prediction data using generate.lda.data).
The function test.lda shows how to specify the necessary arguments and then call the
previous two functions in order to plot the results of the analysis. As given above, it may be
interpreted as an experiment in which a fair coin is tossed 100 times. For each coin toss, if the
result is tails, then a single random observation is taken from a normal bivariate distribution
with mean µ1 = (−1, 0). If the result is heads, then a single random observation is taken from a
normal bivariate distribution with mean µ2 = (1, 0). These 100 observations then constitute the
training data, and lda is applied to it in order to fit the discrimination model. Then, another
100 tosses is made in exactly the same way in order to generate the prediction data.
1. Take a few minutes and try out what happens when you change the parameters in the
function test.lda. What happens if you change mu1 and mu2? What happens if you change
the correlation coefficient rho for the covariance matrix? Also, change the probabilities of
the two classes and verify that the distribution of observations changes accordingly.
2. Since the plot produced by plot.lda.data only shows the predicted class (crosses) for
the data used for prediction, and not the true class, it is not possible to see if any of the
observations has been misclassified by the DA procedure. In order to remedy this, modify
plot.lda.data so that the true class of the predicted data is also shown using a box with
the correct color around the crosses (hint: use pch = 0 for boxes). To do this, it should
be enough to add a single line to the code at the end of the function. Then, use your new
function to estimate the number of observations in predict.data that are misclassified.
In what follows, use your new plotting function since it shows a superset of the information
shown in the more basic version.
3. Suppose now that the class probabilites for the data to be classified are known to be
very different from the ones used for the training data. Assume for example that the
probabilities are (p1 , p2 ) = (0.9, 0.1). Then it is still quite likely that a data point which
is close to the second cluster will belong to the first. However, this aspect is not captured
by the DA prediction unless these new probabilities are passed on as a parameter to the
function predict. If this is not done, the empirical probabilities of the training data will
be used also for prediction. Use ?predict.lda to find out how the call to predict must
be modified in order to change the probabilites when performing prediction. Now, keeping
the generation of the training exactly as before, use (p1 , p2 ) = (0.9, 0.1) during generation
of the prediction data. Then, compare the performance of the predictions (by counting
9
misclassifications) with and without passing these probabilities to the predict function.
Can you see any improvement?
4. So far, we’ve only used two classes. However, the functions introduced for generating and
plotting data supports using more groups of bivariate normal distributions. Extend the
test.lda function with more groups (say a total of 3 or 4) and try out some different mean
vectors and covariance matrices. Do some plots to get a feeling for how the classifications
(and misclassifications) depend on the mean vectors and covariance matrices of the groups.
You may also want to change the allocation (group) probabilities and the total number
of samples taken. Finally, extend the test.lda function to that it also computes the
proportion of misclassified observations among the ones the are predicted.
5. In the final subtask, you will try out the lda function on a real data set. The data set
used is called “iris”. Load it by calling data(iris) and read about it using ?iris. The
main change now is that each observation is done on four variables instead of just two.
It is therefore more challenging to graphically present the results, but you could try to
use what you’ve learned so far in the course and think about some appropriate way to
do it. However, the main task is to check how well DA discriminates by computing the
proportion of misclassified observations. Do this by using the first 25 observations in each
class of species (setosa, versicolor and virginica) as training data, and the remaining 25
observations in each group as prediction data (it may be a good idea to first create two
separate data frames from iris, one holding the training data and the other holding the
prediction data).
10
References
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and
display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA, 95(25):14863–14868.
Gnanadesikan, R. (1989). Discriminant analysis and clustering: Panel on discriminant analysis,
classification, and clustering. Statistical Science, 4(1):34–69. http://projecteuclid.org/
download/pdf_1/euclid.ss/1177012666.
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H.,
Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., and S., L. E. (1999). Molecular
classification of cancer: Class discovery and class prediction by gene expression monitoring.
Science, New Series, 286(5439):531–537.
Ihaka, R. (2007).
Trellis plots.
lectures-trellis.pdf.
https://www.stat.auckland.ac.nz/~ihaka/787/
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: A review. ACM Computing
Surveys, 31(3):265–323.
Wilkinson, L. and Friendly, M. (2009). The history of the cluster heat map. The American
Statistician, 63(2):179–184.
11