Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Canadian Bioinformatics Workshops www.bioinformatics.ca Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM) [email protected] http://www.rglab.org Outline • Exploratory Data Analysis • 1-2 sample t-tests, multiple testing • Clustering • SVD/PCA • Frequentists vs. Bayesians Day 1 4 Clustering (Multivariate analysis) Outline • Basics of clustering • Hierarchical clustering • K-means • Model based clustering Day 1 - Section 3 6 What is it? Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure. Examples: www, gene clustering Day 1 - Section 3 7 Hierarchical clustering Given N items and a distance metric 1. Assign each item to a cluster Initialize the distance matrix between clusters as the distance between items 2. Find the closest pair of clusters and merge them into a single cluster 3. Compute new distances between clusters 4. Repeat 2-3 until call items are classified into a single cluster Day 1 - Section 3 8 Single linkage The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d Day 1 - Section 3 9 Complete linkage The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d Day 1 - Section 3 10 Average linkage The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1 Cluster 2 d=Average of all distances Day 1 - Section 3 11 Example Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cyc 17 time points (2 cell cycles) Day 1 - Section 3 12 Example cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NUL Day 1 - Section 3 13 Example Single linkage Day 1 - Section 3 14 Example Single linkage k=2 Day 1 - Section 3 15 Example Single linkage k=3 Day 1 - Section 3 16 Example Single linkage k=4 Day 1 - Section 3 17 Example Single linkage k=5 Day 1 - Section 3 18 Example Single linkage k=25 Day 1 - Section 3 19 Example Day 1 - Section 3 1 2 3 4 Single linkage k=4 20 Example Complete linkage k=4 Day 1 - Section 3 21 Example Day 1 - Section 3 1 2 Complete linkage k=4 3 4 22 K-means N items, assume K clusters Goal is to minimized over the possible assignments and centroids . represents the location of the cluster. Day 1 - Section 3 23 K-means - algorithm 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move Day 1 - Section 3 24 K-means - algorithm set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) } Day 1 - Section 3 25 Example 1 2 3 4 Why? Day 1 - Section 3 26 Example Day 1 - Section 3 1 2 3 4 28 Summary • K-means and hierarchical clustering methods are useful techniques • Fast and easy to implement Beware of memory requirements for HC • A bit “ad hoc”: • • • Day 1 - Section 3 Number of clusters? Distance metric? Good clustering? 29 Model based clustering • Based on probability models (e.g. Normal mixture models) • We could talk about good clustering • Compare several models • Estimate the number of clusters! Day 1 - Section 3 30 Model based clustering Yeung et al. (2001) Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivari normal distribution with mean and covariance Day 1 - Section 3 31 Model based clustering Banfield and Raftery (1993) Eigenvalue decomposition Volume Orientation Day 1 - Section 3 Shape 32 Model based clustering Day 1 - Section 3 0 Equal volume spherical EII 0 Unequal volume spherical VII 0 Equal volume, shape, orientation (EEE) 0 Unconstrained (VVV) 33 Estimation Likelihood (Mixture model) Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN Day 1 - Section 3 34 Model selection Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC Day 1 - Section 3 35 Model selection We wish to compare two models parameters and respectively. and with Given the observed data D, define the integrated likelihood Probability to observe the data given model NB: Day 1 - Section 3 and might have different dimensions 36 Model selection To compare two models likelihoods. and use the integrate The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model Day 1 - Section 3 37 Model selection Bayesian information criteria: Measure of fit Penalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust) Day 1 - Section 3 38 Example revisited library(mclust) cho.mclust.bic<EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std) Day 1 - Section 3 1 EII 2 EEI 39 Example revisited par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value") Day 1 - Section 3 40 Example revisited Day 1 - Section 3 1 2 3 4 EII 4 clusters 41 Example revisited cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") Day 1 - Section 3 42 Example revisited 1 2 EEI 3 clusters 3 Day 1 - Section 3 43 Summary • Model based clustering is a nice alternative to heuristic clustering algorithms • BIC can be used for choosing the covariance structure and the number of clusters Day 1 - Section 3 44 Conclusion • We have seen a few clustering algorithms • There are many others • Two way clustering • Plaid model ... • Clustering is a useful tool and ... a dangerous weapon • To be consumed with moderation! Day 1 - Section 3 45