Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics II Spring 2006 Homework 5 Due Nov. 2, 2006 email to: [email protected] Hand in the R code and output. All plots should have titles (main="Array 1 Expression") For this homework, we will try out some dimension reduction and clustering. The data are the yeast_cycle.txt data, which are 679 genes from the yeast cell cycle data, selected by Spellman et al to show cyclic behavior. The data stored in the file are normalized M values taken every 20 minutes. I do not have the details of the normalization. 1. Use the singular value decomposition on the data matrix. Obtain a scree plot (d) and the plot the 12 eigengenes (columns of v) against time. Which eigengenes demonstrate the cyclic nature of the data? About how many dimensions are needed to describe the data? 2. Convert the data to zscores and redo question 1. For the remainder of the homework, use the zscores – Euclidean distance on zscores is the same as correlation distance. 3. Use complete linkage clustering to cluster the genes. a) From the height plot, are there any obvious places to cut the dendrogram to form clusters? b) Obtain silhouette plots for 3, 4, 5 and 6 clusters. Is there evidence that one of these numbers of clusters is better than another? c) For the clustering with 4 clusters, plot the cluster mean against time using different colors on a single plot. Is there a pattern? d) Plot the first 2 columns of the u-matrix from the svd of the zscores, using a different color for each cluster. e) If we wanted to add a gene with expression levels Y1 ... Y12 to this plot, we form the row matrix (Y1 ... Y12) . The row of u corresponding to Y is Y v diag(1/d) where diag(1/d) is a diagonal matrix with 1/d on the diagonal. Now consider a gene that expresses with zscore=6 at time i, and is 0 at all other times. For each i, add this gene to your plot. It will be easier to see what is happening if you use plot character "1" for i=1, plot character "2" for i=2 etc. Bioinformatics II Spring 2006 f) If you did e correctly, the genes that express at any given time form a spiral going toward the center of the plot. What does this suggest about the interpretation of the clusters? 3. Use k-means clustering to cluster the genes for k=3,4,5,6. a) Is there evidence that one of these numbers of clusters is better than another? c) For the clustering with 4 clusters, plot the cluster means against time using different colors on a single plot. Is there a pattern? d) Plot the first 2 columns of the u-matrix from the svd of the zscores, using a different color for each cluster. Compare with 2d. 4. Denoising Cluster the data into 4 clusters using the first 4 eigenvectors (columns of U) as the data, using whatever clustering method you like. Plot the cluster means (of the zscores) against time using different colors on a single plot. Plot the first 2 columns of the u-matrix from the svd of the zscores, using a different color for each cluster. Compare with 2c,d and 3c,d. 5. Assessing stability (You will need to write some code for this!) Try 4 and 9 clusters. Try the zscores and the first 4 eigenvectors. This gives 4 different clusterings using the clustering method of your choice. Now, generate 10 samples of zscores and repeat all 4 clusterings with each sample. For each of the 4 clusterings, how many genes are entirely stable –i.e. stay in the same cluster each time? How many genes do not wind up in the same cluster at least 6 times? You will need to find a way to "match" the clusters in the different samples. You could use the profile of the cluster mean, or you could use cluster membership. You will need to find a way to generate samples, starting from the zscores. Here are 2 simple ways: Bioinformatics II Spring 2006 i) Take a random subsample of the genes. (How will you handle the fact that in each sample some genes are missing?) ii) Each zscore has mean 0 and std 1. Add random noise (perhaps normal or t with some number of d.f.) Since you do not want to swamp the pattern, multiple the noise by something small (e.g. .1 or .01) and add it to the observed value. You will then need to recompute the zscores for each gene before processing. (How much noise is appropriate?) 6. Download Eisen's Cluster from http://rana.lbl.gov/EisenSoftware.htm. Use it to visualize your complete linkage and k-means clusterings. Save one the visualizations as a PS file, and send it to me by e-mail. p.s. Don't forget that you do NOT want to cluster the arrays.