Download Exploring Data using Dimension Reduction and Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Designer baby wikipedia , lookup

Metagenomics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Genome evolution wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Ridge (biology) wikipedia , lookup

Transcript
Exploring Data using
Dimension Reduction and
Clustering
Naomi Altman
Nov. 06
Spellman Cell Cycle data
Yeast cells were synchronized by arrest of a
cdc15 temperature-sensitive mutant.
Samples were taken every 10 minutes and one
array was hybridized for each sample using a
reference design. 2 complete cycles are in the
data.
I downloaded the data and normalized using
loess. (Print tip data were not available.)
I used the normalized value of M as the primary
data.
What they did
Supervised dimension reduction = regression
They were looking for genes that have cyclic behavior i.e. a sine or cosine wave in time.
They regressed Mi on sine and cosine waves and
selected genes for which the R2 was high.
The period of the wave was known (from observing the
cells?), so they regression against sine(wt) and
cos(wt) where w is set to give the appropriate period.
If the period is unknown, a method called Fourier
analysis can be used to discover it.
Regression
Suppose we are looking for genes that are associated
with a particular quantitative phenotype, or have a
pattern that is known in advance.
E.g. Suppose we are interested in genes that change
linearly with temperature and quadratically with pH.
Y=b0 + b1Temp + b2pH + b3pH2 + noise
We might fit this model for each gene (assuming that the
arrays came from samples subjected to different levels
of Temp and pH.
This is similar to differential expression analysis - we
have a multiple comparisons problem.
Regression
We might compute an adjusted p-value, or goodness-offit statistic to select genes based on the fit to a pattern.
If we have many "conditions" we do not need to replicate
as much as in differential expression analysis because
we consider any deviation from the "pattern" to be
random variation.
What I did
Unsupervised
dimension
reduction:
I used SVD on the
832 genes x 24
time points.
We can see that
eigengene 5 has
the cyclic genes.
For class
I extracted the 304 spots with variance greater
than 0.25.
To my surprise, several of these were empty or
control spots. I removed these.
This leaves 295 genes which are in yeast.txt.
Read these into R.
Also: time=c(10,30,50,10*(7:25),270,290)
yeast=read.delim("yeast.txt",header=T)
time=c(10,30,50,10*(7:25),270,290)
M.yeast=yeast[,2:25] #strip off the gene names
svd.m=svd(M.yeast) # svd
#scree plot
plot(1:24,svd.m$d)
par(mfrow=c(4,4))
# plot the first 16 "eigengenes"
for (i in 1:16) plot(time,svd.m$v[,i],main=paste("Eigen",i),type="l")
par(mfrow=c(1,1))
plot(time,svd.m$v[,1],type="l",ylim=c(min(svd.m$v),max(svd.m$v)))
for (i in 2:4) lines(time,svd.m$v[,i],col=i)
#It looks like "eigengenes" 2-4 have the periodic components.
# Reduce dimension by finding genes that are linear combinations
# of these 3 patterns by regression
# We can use limma to fit a regression to every gene and use e.g.
# the F or p-value to pick significant genes
library(limma)
design.reg=model.matrix(~svd.m$v[,2:4)
fit.reg=lmFit(M.yeast,design.reg)
# The "reduced dimension" version of the genes are the fitted
# values: b0+ b1v2 + b2v3 +b3v4 vi is the ith column of svd.m$v
# bi are the coefficients
# Lets look at gene 1 (not periodic) and genes 5, 6, 7
plot(time,M.yeast[i,],type="l")
lines(time,fit.reg$coef[i,1]+ fit.reg$coef[i,2]*svd.m$v[,2]+
fit.reg$coef[i,3]*svd.m$v[,3]+fit.reg$coef[i,4]*svd.m$v[,4])
# Select the genes with a strong period component
# We could use R2 but in limma, it is simplest to compute the
# moderated F-test for regression and then use the p-values.
# Limma requires us to remove the intercept from the coefficients
# to get this test :(
contrast.matrix=cbind(c(0,1,0,0),c(0,0,1,0),c(0,0,0,1))
fit.contrast=contrasts.fit(fit.reg,contrast.matrix)
efit=eBayes(fit.contrast)
# We will use the Bonferroni method to pick a significance level
# a=0.05/#genes = 0.00017
sigGenes=which(efit$F.p.value<0.00017)
#plot a few of these genes
# You might also want to plot a few genes with p-value > 0.5
Note that we used the normalized but uncentered unscaled data
for this exercise.
Things might look very different if the data were transformed.
Clustering
We might ask which genes have similar
expression patterns.
Once we have expressed (dis)similarity as a
distance measure, we can use this measure to
cluster genes that are similar.
There are many methods. We will discuss 2 hierarchical clustering
k-means clustering
Hierarchical Clustering (agglomerative)
1. Choose a distance function for points d(x1,x2)
2. Choose a distance function for clusters D(C1,C2) (for clusters
formed by just one point, D reduces to d).
3. Start from N clusters, each containing one data point.
At each iteration:
a) Using the current matrix of cluster distances, find the two
closest clusters.
b)Update the list of clusters by merging the two closest.
c) Update the matrix of cluster distances accordingly
4. Repeat until all data points are joined in one cluster.
Remarks:
• The method is sensitive to anomalous data points/outliers
F. Chiaromonte Sp 06 5
Hierarchical Clustering (agglomerative)
1. Choose a distance function for points d(x1,x2)
2. Choose a distance function for clusters D(C1,C2) (for clusters formed by just one
point, D reduces to d).
3. Start from N clusters, each containing one data point.
At each iteration:
a) Using the current matrix of cluster distances, find the two closest clusters.
b)Update the list of clusters by merging the two closest.
c) Update the matrix of cluster distances accordingly
4. Repeat until all data points are joined in one cluster.
Remarks:
1. The method is sensitive to anomalous data points/outliers.
2. Mergers are irreversible: “bad” mergers occurring early on
affect the structure of the nested sequence.
3. If two pairs of clusters are equally (and maximally) close at a
given iteration, we have to choose arbitrarily; the choice will
affect the structure of the nested sequence.
F. Chiaromonte Sp 06 5
Defining cluster distance: the linkage
function
D(C1,C2) is a function of the distances f{ d(x1i,x2j) }
x1i in C1
x2j in C2
Single (string-like, long)
Complete (ball-like, compact)
Average
Centroid
f=min
f=max
f=average
d(ave(x1i),ave(x2j) )
Single and complete linkages produce nested sequences invariant
under monotone transformations of d – not the case for average
linkage.
However, the latter is a compromise between “long”, “stringy”
clusters produced by single, and “round”, “compact” clusters
produced by complete.
F. Chiaromonte Sp 06 5
Example
Agglomeration step in
constructing
the nested sequence (first
iteration):
1. 3 and 5 are the closest, and
are therefore merged in cluster
“35”.
2. new distance matrix computed
with complete linkage.
Ordinate: distance, or height, at
which each merger occurred.
Horizontal ordering of the data
points is any order preventing
intersections of branches.
F. Chiaromonte Sp 06 5
single linkage
complete linkage
Hierarchical Clustering
Hierarchical clustering, per se, does not dictate a
partition and a number of clusters.
It provides a nested sequence of partitions (this
is more informative than just one partition).
To settle on one partition, we have to “cut” the
dendrogram.
Usually we pick a height and cut there - but the
most informative cuts are often at different
heights for different branches.
F. Chiaromonte Sp 06 5
hclust(dist(M.yeast),
method="single")
Partitioning algorithms: K-means.
1.
2.
3.
4.
Choose a distance function for points d(xi,xj).
Choose K = number of clusters.
Initialize the K cluster centroids (with points chosen at random)
Use the data to iteratively relocate centroids, and reallocate
points to closest centroid.
At each iteration:
a) Compute distance of each data point from each current
centroid.
b) Update current cluster membership of each data point,
selecting the centroid to which the point is closest.
c) Update current centroids, as averages of the new clusters
formed in 2.
5. Repeat until cluster memberships, and thus centroids, stop
changing.
F. Chiaromonte Sp 06 5
Remarks:
1. This method is sensitive to anomalous data points/outliers.
2. Points can move from one cluster to another, but the final
solution depends strongly on centroid initialization (so we
usually restart several times to check).
3. If two centroids are equally (and maximally) close to an
observation at a given iteration, we have to choose arbitrarily
(the problem here is not so serious because points can move
later).
4. There are several “variants” of the k-means algorithm using e.g.
median.
5. K-means converges to a local minimum of the total withincluster square distance (total within cluster sum of squares) –
not necessarily a global one.
6. Clusters tend to be ball-shaped with respect to the chosen
distance.
Starting from the arbitrarily chosen open rectangles:
Assign every data value to a cluster defined by the nearest
centroid.
Recompute the centroids based on the most current clustering.
Reassign data values to cluster and repeat.
Remarks:
The algorithm does not
indicate how to pick K.
To change K, redo the
partitioning. The clusters
are not necessarily
nested.
F. Chiaromonte Sp 06 5
Here is the yeast
data. (4 runs) To
display the
clusters, we often
use the main
eigendirections
(svd$u).
These do show
that much of the
clustering is
defined by these
2 directions, but it
is not clear that
there really are
clusters.
6 clusters
4 clusters
k.out=kmeans(M.yeast,centers=6)
plot(svd.m$u[,1],svd.m$u[,2],col=k.out5$cl)
Other partitioning Methods
1.
2.
3.
4.
Partitioning around medioids (PAM): instead of averages, use
multidim medians as centroids (cluster “prototypes”). Dudoit
and Freedland (2002).
Self-organizing maps (SOM): add an underlying “topology”
(neighboring structureon a lattice) that relates cluster
centroids to one another. Kohonen (1997), Tamayo et al.
(1999).
Fuzzy k-means: allow for a “gradation” of points between
clusters; soft partitions. Gash and Eisen (2002).
Mixture-based clustering: implemented through an EM
(Expectation-Maximization)algorithm. This provides soft
partitioning, and allows for modeling of cluster centroids and
shapes. Yeung et al. (2001), McLachlan et al. (2002)
F. Chiaromonte Sp 06 5
Assessing the Clusters
Computationally
The bottom line is that the clustering is "good" if it is biologically
meaningful (but this is hard to assess).
Computationally we can:
1) Use a goodness of cluster measure, such as the within cluster
distances compared to the between cluster distances.
2) Perturb the data and assess cluster changes:
a) add noise (maybe residuals after ANOVA)
b) resample (genes, arrays)