Download PPT - Bioinformatics.ca

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Essential Statistics in
Biology: Getting the
Numbers Right
Raphael Gottardo
Clinical Research Institute of Montreal (IRCM)
[email protected]
http://www.rglab.org
Outline
• Exploratory Data Analysis
• 1-2 sample t-tests, multiple testing
• Clustering
• SVD/PCA
• Frequentists vs. Bayesians
Day 1
4
Clustering
(Multivariate analysis)
Outline
• Basics of clustering
• Hierarchical clustering
• K-means
• Model based clustering
Day 1 - Section 3
6
What is it?
Clustering is the classification of similar
objects into different groups.
Partition a data set into subsets (clusters), so
that the data in each subset are “close” to one
another - often proximity according to some
defined distance measure.
Examples: www, gene clustering
Day 1 - Section 3
7
Hierarchical clustering
Given N items and a distance metric
1. Assign each item to a cluster
Initialize the distance matrix between
clusters as the distance between items
2. Find the closest pair of clusters and merge
them into a single cluster
3. Compute new distances between clusters
4. Repeat 2-3 until call items are classified
into a single cluster
Day 1 - Section 3
8
Single linkage
The distance between clusters is defined as the
shortest distance from any member of one
cluster to any member of the other cluster.
Cluster 1
Cluster 2
d
Day 1 - Section 3
9
Complete linkage
The distance between clusters is defined as the
greatest distance from any member of one
cluster to any member of the other cluster.
Cluster 1
Cluster 2
d
Day 1 - Section 3
10
Average linkage
The distance between clusters is defined as the
average distance from any member of one
cluster to any member of the other cluster.
Cluster 1
Cluster 2
d=Average of all distances
Day 1 - Section 3
11
Example
Cell cycle dataset (Cho et al. 1998)
Expression levels of ~6000 genes during the cell cyc
17 time points (2 cell cycles)
Day 1 - Section 3
12
Example
cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NUL
Day 1 - Section 3
13
Example
Single linkage
Day 1 - Section 3
14
Example
Single linkage
k=2
Day 1 - Section 3
15
Example
Single linkage
k=3
Day 1 - Section 3
16
Example
Single linkage
k=4
Day 1 - Section 3
17
Example
Single linkage
k=5
Day 1 - Section 3
18
Example
Single linkage
k=25
Day 1 - Section 3
19
Example
Day 1 - Section 3
1
2
3
4
Single linkage
k=4
20
Example
Complete linkage
k=4
Day 1 - Section 3
21
Example
Day 1 - Section 3
1
2 Complete linkage
k=4
3
4
22
K-means
N items, assume K clusters
Goal is to minimized
over the possible assignments and centroids
.
represents the location of the cluster.
Day 1 - Section 3
23
K-means - algorithm
1. Divide the data into K clusters
Initialize the centroids with the mean of
the clusters
2. Assign each item to the cluster with
closest centroid
3. When all objects have been assigned,
recalculate the centroids (mean)
4. Repeat 2-3 until the centroids no longer
move
Day 1 - Section 3
24
K-means - algorithm
set.seed(100)
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol =
2))
colnames(x) <- c("x", "y")
set.seed(100)
for(i in 1:4)
{
set.seed(100)
cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i)
plot(x,col=cl$cluster)
points(cl$centers, col = 1:5, pch = 8, cex=2)
Sys.sleep(2)
}
Day 1 - Section 3
25
Example
1
2
3
4
Why?
Day 1 - Section 3
26
Example
Day 1 - Section 3
1
2
3
4
28
Summary
•
K-means and hierarchical clustering
methods are useful techniques
•
Fast and easy to implement
Beware of memory requirements for HC
•
A bit “ad hoc”:
•
•
•
Day 1 - Section 3
Number of clusters?
Distance metric?
Good clustering?
29
Model based clustering
• Based on probability models (e.g. Normal
mixture models)
• We could talk about good clustering
• Compare several models
• Estimate the number of clusters!
Day 1 - Section 3
30
Model based clustering
Yeung et al. (2001)
Multivariate observations
K clusters
Assume observation i belongs to cluster k, then
that is each cluster can be represented by a multivari
normal distribution with mean and covariance
Day 1 - Section 3
31
Model based clustering
Banfield and Raftery (1993)
Eigenvalue decomposition
Volume Orientation
Day 1 - Section 3
Shape
32
Model based clustering
Day 1 - Section 3
0
Equal volume spherical EII
0
Unequal volume spherical VII
0
Equal volume, shape,
orientation (EEE)
0
Unconstrained (VVV)
33
Estimation
Likelihood (Mixture
model)
Given the number of clusters and the covariance
structure the EM algorithm can be used
Mclust R package available from CRAN
Day 1 - Section 3
34
Model selection
Which model is appropriate?
- Which covariance structure?
- How many clusters?
Compare the different models using BIC
Day 1 - Section 3
35
Model selection
We wish to compare two models
parameters
and respectively.
and
with
Given the observed data D, define the
integrated likelihood
Probability to observe the data given model
NB:
Day 1 - Section 3
and
might have different dimensions
36
Model selection
To compare two models
likelihoods.
and
use the integrate
The integral is difficult to compute!
Bayesian information criteria:
is the maximum likelihood
is the number of parameter in model
Day 1 - Section 3
37
Model selection
Bayesian information criteria:
Measure of fit
Penalty term
A large BIC score indicates strong evidence for
the corresponding model
BIC can be used to choose the number of
clusters
and the covariance parametrization (Mclust)
Day 1 - Section 3
38
Example revisited
library(mclust)
cho.mclust.bic<EMclust(cho.data.std,modelNames=c("EII","EEI"))
plot(cho.mclust.bic)
cho.mclust<-EMclust(cho.data.std,4,"EII")
sum.cho<-summary(cho.mclust,cho.data.std)
Day 1 - Section 3
1 EII
2 EEI
39
Example revisited
par(mfrow=c(2,2))
matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value")
matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value")
matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")
matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value")
Day 1 - Section 3
40
Example revisited
Day 1 - Section 3
1
2
3
4
EII
4 clusters
41
Example revisited
cho.mclust<-EMclust(cho.data.std,3,"EEI")
sum.cho<-summary(cho.mclust,cho.data.std)
par(mfrow=c(2,2))
matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression
value")
matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression
value")
matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression
value")
Day 1 - Section 3
42
Example revisited
1
2
EEI
3 clusters
3
Day 1 - Section 3
43
Summary
• Model based clustering is a nice
alternative to heuristic clustering
algorithms
• BIC can be used for choosing the
covariance structure and the number of
clusters
Day 1 - Section 3
44
Conclusion
• We have seen a few clustering algorithms
• There are many others
• Two way clustering
• Plaid model ...
• Clustering is a useful tool and ... a
dangerous weapon
• To be consumed with moderation!
Day 1 - Section 3
45