Download Statistical analysis of array data: Dimensionality reduction, clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Statistical analysis of array data:
Dimensionality reduction,
Clustering
Katja Astikainen, Riikka Kaven
25.2.2005
Contents
•
•
•
•
•
•
Problems and approaches
Dimensionality reduction by PCA
Clustering overview
Hierarchical clustering
K-means
Mixture models and EM
Problems and approaches
• Basic idea is to find patterns of expression across
multiple genes and experiments
– Models of expression are utilized in e.g. classification of
diseases more precisely (tautiluokitus,sairausaste)
– Expression patterns can be utilized to exploring cellular
pathways
– With help of gene expression modeling and also condition
(experiment) clustering one can find genes that are co-regulated
– clustering methods can also be used for sequens alignments
• There are several methods for this, but we are going
introduce:
– Principal Component Analysis (PCA)
– Clustering (hierarchical, K-means, EM)
Dimensionality reduction by PCA
PCA is statistical data analysis technique
– method to reduce dimensionality
– method to identify new meaningful underlying
variables
– method to compress the data
– method to visualize the data
Dimensionality reduction by PCA
• We have N data points xi,…,xn in M dimensional space,
where values x are genes expression vectors.
• With PCA we can reduct the dimension to K which is
usually much lower than M.
• Imagine taking three-dimensional cloud of datapoints
and rotating it so you can view it from different
perspectives. You might imagine that certain views would
allow you to better separate the data into groups than
others.
• With PCA we can ignore some of the redundant
experiments (low variance), or use some average of the
information without loss of information.
Dimensionality reduction by PCA
•
We are looking for unit vector u1 such that,
on average the squared length of of the
projection of the xs along the u1 is
maximal (vectors are column vectors)

u1  arg max E u x
u 1
•
T

2
Generally if the first u1,…,uk-1 components
have been determined the next
component is the one that maximize the
residual variance
2
k 1

 
T
uk  arg max   x   ui x ui  
u 1
i 1
 

 
•
The principal components for the
expression vectors are given by ci=uix
Dimensionality reduction by PCA
• How can we find the eigenvectors ui
– Find such eigenvectoctors wich shows the most informative
part of the data; vectors that show the direction of maximal
variance of the data.
• Fist we calculate the covariance matrix
•
 
C  E xxT
Find out the eigenvalues
covariance matrix
i
and eigenvectors uk from the
Cuk  k uk
• eigen value is a measure of the proportion of the variance
explained by the corresponding eigenvector
• Select the uis wich are the eigenvectors of the sample covariance
matrix associated with the K largest eigenvalues
– eigenvectors wich explains the most of the variance in the data
– discovers the important features and patterns in the data
– for datavisualization use two or three dimensional spaces
Clustering overview
• Data analysis methods for discovering patterns and
underlying cluster structures
• Different kind of methods such as Hierarchical clustering,
partitioning based k-means and Self Organizing map
(SOM)
• There’s no single method that is best for every data
• clustering methods are unsuperviced methods (like kmeans)
– there is no information about the true clusters or their amount
– clustering algorithms are used for analysing the data
– discovered clusters are just estimations of the truth (often the
result is local optimum)
Clustering overview
• Data types
– Typically the clustered data is numerical vector data like
gene expression data (expression vectors)
– Numerical data can also be represented in relative
coordinates
– Data might also be qualitative (nominal) which brings
challenge for comparing the data elements
• Number of clusters is often unknown
• One way to estimate the number of clusters is analysing the
data by PCA
– you might use the eigenvectors to estimate the number of
clusters
• Other way is to make guesses and justify the number of cluster
by good results (what ever they are)
Clustering overview
•
•
•
Similarity measures
– Pearson correlation
(normalized vectors dot
product)
r
Distance measures
– euclidean (natural distance
between two vectors)
It is important to use appropriate
distance/similarity measures
– in euclidean space vectors
might be close to each other
but their correlation could be 0

n
i 1
( xi  x )( yi  y )
n
n
 (x  x )  ( y  y)
2
i 1
d
i
i 1
n

i 1
xi  yi
2
i
2
1000000000
0000000001
Clustering overview
Cost function and probabilististic interpretation:
• For comparing different ways of clustering the same
data, we need some kind of cost function for the
clustering algorithm
• The goal of clustering is to try to minimize such cost
function
• Generally cost function depends on some quantities:
– Centers of the clusters
– The distance of each point in a cluster to the cluster center
– The average degree of similarity of a points in a cluster
• Cost functions are algorithm spesific, so comparing the
results of different clustering algorithms might be almost
impossible
Clustering overview
Cost function and probabilististic interpretation:
• There are some advantages associated
with probabilistic models
they are often utilized in cost functions
• It is popular method to use in the clustering cost
function the negative log-likelihood of an
underlying probabilistic model
Hierarchical clustering
• The basic idea is to construct hierarchical tree which
consist of nested clusters
• Algorithm is bottom-up method where clustering starts
from single data points (genes) and stops when all data
points are in same cluster (the root of the tree)
• Clustering begins with computing pairwise similarities
between each data point and when clusters are formed
similarity comparing is made between clusters.
• Branching process is repeated at most N-1 times which
means that the leaf nodes (genes) make first pairs and
the tree becomes a binary-tree.
Hierarchical clustering: phases
• Calculate the pairwais similarities between data
points into matrix
• Find two datapoints (nodes in the tree) wich are
closest to each other or are most similar.
• Group them together to make a new cluster.
• Calculate the averige vector of datapoints which
is expression profile for the cluster (inner node in
the tree that joins the leaf nodes = datapoints
vectors)
• Calculate new correlation matrix
– calculate pairwise similarity between the new cluster
and other clusters.
Tree Visualization
• With Hierarchical clustering we could find the
dendoclusters of datapoints but the constructed tree isn’t
yet in optimal order
• After finding the dendogram which tells the similarity
between nodes and genes, the final and optimal linear
order for nodes can be discovered with help of dynamic
programming
Tree visualization with dynamic
programming [2]
experiments
A
genes
B
C
D
E
Goal: Quickly and easily arrange the data for further
inspection
Tree visualization with dynamic
programming [2]
A
B
C
D
E
Greedily join nearest cluster pair [3]
nearest: we use correlation coefficient (normalized dot product)
can use other measures as well
Tree visualization with dynamic
programming [2]
A
C
B
D
E
•
•
Greedily join nearest cluster pair [3]
Optimal ordering: minimize summed distance between consecutive genes
– Criterion suggested by Eisen
Tree visualization with dynamic
programming [2]
B
A
C
E
D
•
•
Greedily join nearest cluster pair [3]
Optimal ordering: minimize summed distance between consecutive genes
– Criterion suggested by Eisen
Hierarchical clustering:dynamic
programming
• Optimal linear ordering for genes
expression vectors can be
computed in O(N4) steps
• We would like to maximize the
similarity between
neighbournodes
 CG
N 1
i 1
 (i )
, G (i 1)

where G (i ) is the ith leaf when
the tree is ordered according to 
. The algorithm works from bottom
up towards the root by
recursively computing the cost of
the optimal ordering M(V,U,W)
[1]
Hierarchical clustering:dynamic
programming
• The dynamic programming recurrence
is given by:
M V ,U ,W   max M (Vl ,U , R)  M (Vr , S ,W )  C( R, S )
RVlr SVrl
• The optimal cost M(V) for V is obtained
by maximizing over all pairs, U, W.
• The global optimal cost is obtained
recursively when V is the root of the
tree, and the optimal tree can be found
by standard backtracking.
[1]
k-means algorithm
• Data points are divided into k clusters
• Find by iterating such group of centroids
C={v1,…,vK}, which minimize the squared distances
(d2) between expression vectors xj…xn and the
centroid which they belong REP[xj,C]:




LC    d x j , REP x j , C ,
n
2
j 1
where the distance measure d is euclidean.
In practise the result is approximation (local
optimum).
• Each expression vector belongs into one cluster.
k-means-algorithm: phases
1.
2.
3.
4.
5.
Initially put the expression vectors randomly into k
clusters.
Define the clusters centroids by calculating the
average vector from expression vectors which belong
into the cluster.
Compute the distances between expression vectors
and centroids.
Move every expression vector into cluster with closest
centroid.
Define new centroids for clusters. If clusters centroids
are stabile or some other stopping criteria is achieved,
stop algorithm. Otherwise repeat steps 3-5.
k-means clustering
Kuva 4 [4]: K-means example: 1) Expression vectors are randomly divided into
three clusters 2) Define the centroids. 3) Compute expression vectors distances
to the centroids. 4) Compute centroids new locations. 5) Compute expression
vectors distances to the centroids. 6) Compute centroids new locations and finish
the clustering cause the centroids are stabilized. Clusters formed are circled.
Mixture models and EM
• EM algortihm is based on modelling complex distributions by
combining together simple Gaussian distributions of clusters
• K-means algorithm is an oline approximation of EM algorithm
– maximizes the quadratic log-likelihood (minimizes
quadratic distances of datapoints to their clusters
centroids)
• The EM algorithm is used to optimize the centers of each
cluster (weighted variance is maximal) which means that we
find the maximum likelihood estimate for the center of the
Gaussian distribution of the cluster
• Some initial guesses has to be made before starting
– number of clusters (k)
– initial centers of clusters
Mixture models and EM
Algorithm is an iterative process with two optimization task:
• E-step: the membership probabilities (hidden variables) of
each datapoint for each mixture model (cluster) are
Estimated
PM k di   Pdi M k  PM k  Pdi 
The maximum likehood estimate of the mixing coefficient is
the sample mean of the conditional probatilities that d1
comes from model k
1
 
N
*
k
 M
N
i 1
k
di 
Mixture model and EM
• M-step: K-separate estimation problems of
Maximizing the log-likelihood of k component with a
weight given by the estimated membership
probabilities
 M
N
i 1
k
di 
 log (d i M k )
wkj
0
• In M-step means of Gaussian distributions are
estimated so that they maximize the likelihood of the
models
References
[1]
[2]
[3]
[4]
Baldi, P and Hatfield, Wesley G, DNA Microarrays and Gene
Expression, Cambridge University Press, 2002, 73-96.
URL http://www-2.cs.cmu.edu/~zivbj/class04/lecture11.ppt
Eisen MB, Spellman PT, Brown PO and Botstein D. (1998).
Cluster Analysis and Display of Genome-Wide Expression
Patterns. Proc Natl Acad Sci U S A 95, 14863-8.
Gasch, A. P. and Eisen, M. B., Exploring the conditional
coregulation of yeast gene expression through fuzzy kmeans clustering. Genome Biology, 3,11(2002), 1–22.
URL http://citeseer.ist.psu.edu/gasch02exploring.html.