Download Hwk6F06

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene regulatory network wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Community fingerprinting wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Ridge (biology) wikipedia , lookup

Transcript
Bioinformatics II
Spring 2006
Homework 5
Due Nov. 2, 2006
email to: [email protected]
Hand in the R code and output. All plots should have titles (main="Array 1
Expression")
For this homework, we will try out some dimension reduction and clustering. The
data are the yeast_cycle.txt data, which are 679 genes from the yeast cell cycle data,
selected by Spellman et al to show cyclic behavior. The data stored in the file are
normalized M values taken every 20 minutes. I do not have the details of the
normalization.
1. Use the singular value decomposition on the data matrix. Obtain a scree plot (d)
and the plot the 12 eigengenes (columns of v) against time.
Which eigengenes demonstrate the cyclic nature of the data?
About how many dimensions are needed to describe the data?
2. Convert the data to zscores and redo question 1.
For the remainder of the homework, use the zscores – Euclidean distance on zscores
is the same as correlation distance.
3. Use complete linkage clustering to cluster the genes.
a) From the height plot, are there any obvious places to cut the dendrogram to form
clusters?
b) Obtain silhouette plots for 3, 4, 5 and 6 clusters. Is there evidence that one of
these numbers of clusters is better than another?
c) For the clustering with 4 clusters, plot the cluster mean against time using
different colors on a single plot. Is there a pattern?
d) Plot the first 2 columns of the u-matrix from the svd of the zscores, using a
different color for each cluster.
e) If we wanted to add a gene with expression levels Y1 ... Y12 to this plot, we form
the row matrix (Y1 ... Y12) . The row of u corresponding to Y is Y v diag(1/d) where
diag(1/d) is a diagonal matrix with 1/d on the diagonal.
Now consider a gene that expresses with zscore=6 at time i, and is 0 at all other
times. For each i, add this gene to your plot. It will be easier to see what is
happening if you use plot character "1" for i=1, plot character "2" for i=2 etc.
Bioinformatics II
Spring 2006
f) If you did e correctly, the genes that express at any given time form a spiral going
toward the center of the plot. What does this suggest about the interpretation of the
clusters?
3. Use k-means clustering to cluster the genes for k=3,4,5,6.
a) Is there evidence that one of these numbers of clusters is better than another?
c) For the clustering with 4 clusters, plot the cluster means against time using
different colors on a single plot. Is there a pattern?
d) Plot the first 2 columns of the u-matrix from the svd of the zscores, using a
different color for each cluster.
Compare with 2d.
4. Denoising
Cluster the data into 4 clusters using the first 4 eigenvectors (columns of U) as the
data, using whatever clustering method you like.
Plot the cluster means (of the zscores) against time using different colors on a single
plot.
Plot the first 2 columns of the u-matrix from the svd of the zscores, using a different
color for each cluster.
Compare with 2c,d and 3c,d.
5. Assessing stability (You will need to write some code for this!)
Try 4 and 9 clusters. Try the zscores and the first 4 eigenvectors. This gives 4
different clusterings using the clustering method of your choice.
Now, generate 10 samples of zscores and repeat all 4 clusterings with each sample.
For each of the 4 clusterings, how many genes are entirely stable –i.e. stay in the
same cluster each time? How many genes do not wind up in the same cluster at
least 6 times?
You will need to find a way to "match" the clusters in the different samples. You
could use the profile of the cluster mean, or you could use cluster membership.
You will need to find a way to generate samples, starting from the zscores. Here are
2 simple ways:
Bioinformatics II
Spring 2006
i) Take a random subsample of the genes. (How will you handle the fact that in each
sample some genes are missing?)
ii) Each zscore has mean 0 and std 1. Add random noise (perhaps normal or t with
some number of d.f.) Since you do not want to swamp the pattern, multiple the
noise by something small (e.g. .1 or .01) and add it to the observed value. You will
then need to recompute the zscores for each gene before processing. (How much
noise is appropriate?)
6. Download Eisen's Cluster from http://rana.lbl.gov/EisenSoftware.htm. Use it to
visualize your complete linkage and k-means clusterings. Save one the
visualizations as a PS file, and send it to me by e-mail.
p.s. Don't forget that you do NOT want to cluster the arrays.