Download o Sequential K Means Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Transcript




General problem
o Unsupervised learning
o Classification = clustering
o Static distribution (in contrast with HMM)
o Large amounts of data
o Problems strange shapes
Applications
o Clustering gene expression data
 Cluster analysis and display of genome-wide expression
patterns. M. B. Eisen et al.(1998) PNAS 95, 1486314868
 Distinct types of diffuse large B-cell lymphoma identified
by gene-expression profiling. A. Alizadeh et al. (2003)
Nature 403 503-511
 Revealing
modular organization
in
the
yeast
transcriptional network. J. Ihmels et al. (2002) Nature
Genet. 31 370-377
 Cluster analysis of gene expression data E. Domany
(2003) J. Stat. Phys. 110 1117-1139
o Clustering genes into families (clustering graphs)
 An efficient algorithm for large-scale detection of protein
families A. Enright et al. (2002) Nucl. Acids. Res. 30
1575-1584
 Curvature of co-links uncovers hidden thematic layers in
the World Wide Web JP Eckmann and E Moses (2002)
PNAS 99 5825-5829
 A genomic perspective on protein families RL Tatusov et
al. (1997) Science 278 631-637
o Clustering transcription factor binding sites
 Probabilistic clustering of sequences: Inferring new
bacterial regulons by comparative genomics E. van
Nimwegen et al. (2002) PNAS 99 7323-7328.
Typical Distance metrics
o
Genetic distance (mutation distance)
o
Amino-acid distance
o
Euclidean distance
o
Manhattan distance
o
Bit match
Distance Measures: Minkowski Metric
Suppose two objects x and y both have p features:
x  ( x1 x 2
xp )
y  ( y1 y 2
yp )
The Minkowski metric is defined by
d ( x, y ) 
p
r
| x
i
i 1
 yi |r
1, r  2 (Euclidean distance )
d ( x, y ) 
p
2
| x
i
 yi |2
i 1
2, r  1 (Manhattan distance)
d ( x, y ) 
p
| x
i
 yi |
i 1
3, r   ("sup"
distance
)
d ( x, y )  max | xi  yi |
1 i  p

non-hierarchical methods divide a dataset of N objects into M
clusters, with or without overlap.

hierarchical methods produce a set of nested clusters in which each
pair of objects or clusters is progressively nested in a larger cluster
until only one cluster remains.

non-hierarchical methods
o
partitioning methods - classes are mutually exclusive
o
clumping method, - overlap is allowed.

Hierarchical methods
o
Divisive methods begin with all objects in a single cluster and
at each of N-1 steps divides some clusters into two smaller
clusters, until each object resides in its own cluster.
o
Agglomerative or - The hierarchy is build up in a series of N1 agglomerations, or Fusion, of pairs of objects, beginning with
the un-clustered dataset.
Non-hierarchical Methods

Partitioning Methods
o
The partitioning methods generally result in a set of M
clusters, each object belonging to one cluster. Each cluster may
be represented by a centroid or a cluster representative
o
T center is usually the arithmetic mean of the attribute
vectors for all objects within a cluster
o
If the number of the clusters is large, the centroids can be
further clustered to produces hierarchy within a dataset.
o
Partitioning methods are divided acording to the number of
passes over the data.
o
Single pass

Basic partitioning methods
o
Multiple passes

K –means (Very widely used)

Single pass Algorithm:
o
Make the first object the centroid for the first cluster.
o
For the next object, calculate the similarity, S, with each
existing cluster centroid, using some similarity coefficient.
o
If the highest calculated S is greater than some specified
threshold value, add the object to the corresponding cluster and
re determine the centroid; otherwise, use the object to initiate a
new cluster. If any objects remain to be clustered, return to step
2.

Single Pass – Advantages and Disadvantages
o
As its name implies, this method requires only one pass
through the dataset; the time requirements are typically of order
O(NlogN) for order O(logN) clusters. This makes it a very
efficient clustering method.
o
A disadvantage is that the resulting clusters are not
independent of the order in which the documents are processed,
with the first clusters formed usually being larger than those
created later in the clustering run

K-means: The Algorithm
o
Given a set of numeric points in d dimensional space, and
integer k
o
Algorithm generates k (or fewer) clusters as follows:
o
Assign all points to a cluster at random
o
Compute centroid for each cluster
o
Reassign each point to nearest centroid
o
If centroids changed go back to stage 2

K-means: Sample Application
o
Gene clustering.
o
Given a series of microarray experiments measuring the
expression of a set of genes at regular time intervals in a
common cell line.
o
Normalization allows comparisons across microarrays.
o
Produce clusters of genes which vary in similar ways over
time.
o
Hypothesis: genes which vary in the same way may be coregulated and/or participate in the same pathway.
K-means: Weaknesses
o
Must choose parameter k in advance, or try many values.
o
Data must be numerical and must be compared via
Euclidean distance (there is a variant called the k-medians
algorithm to address these concerns)
o
The algorithm works best on data which contains spherical
clusters; clusters with other geometry may not be found.
o
The algorithm is sensitive to outliers---points which do not
belong in any cluster. These can distort the centroid positions
and ruin the clustering.
Adaptive K-means Method
o
In order to increase the stability in cluster analysis, specific
weights or adaptive weights in the distance formula could be


applied rather than ordinary weight

or
.
Fuzzy K means
o
The clusters produced by the k-means procedure are
sometimes called "hard" or "crisp" clusters, since any feature
vector x either is or is not a member of a particular cluster. This
is in contrast to "soft" or "fuzzy" clusters, in which a feature
vector x can have a degree of membership in each cluster.
o
The fuzzy-k-means procedure allows each feature vector x
to have a degree of membership in Cluster i:

Fuzzy K means Algorithm
o Make initial guesses for the means m1, m2,..., mk
o Until there are no changes in any mean:
 Use the estimated means to find the degree of
membership u(j,i) of xj in Cluster i; for example, if dist(j,i)
= exp(- || xj - mi ||2 ), one might use u(j,i) = dist(j,i) / j
dist(j,i)
 For i from 1 to k
 Replace mi with the fuzzy mean of all of the
examples for Cluster i
 u( j, i) x

 u( j, i)
2
mi
j
j
2
j
 end_for
o end_until

Sequential K means
o Another way to modify the k-means procedure is to update the
means one example at a time, rather than all at once. This is
particularly attractive when we acquire the examples over a
period of time, and we want to start clustering before we have
seen all of the examples.
o The algorithm is similar to the regular k means algorithm.
o Sequential K Means Algorithms
 Make initial guesses for the means m1, m2, ..., mk
 Set the counts n1, n2, ..., nk to zero
 Until interrupted
 Acquire the next example, x
 If mi is closest to x
o Increment ni
o Replace mi by mi + (1/ni)*( x - mi)
 end_if
 end_until

K-medoids clustering method
o K-medoids: the most centrally located objects in a cluster
o Arbitrarily choose k objects as the initial medoids
o Until no change, do
 (Re)assign each object to the cluster to which the nearest
medoid
 Randomly select a non-medoid object o’, compute the
total cost, S, of swapping medoid o with o’
 If S < 0 then swap o with o’ to form the new set of k
medoids
 S    po int s d ( p, o)
clusters

Principal Component Analysis
o Problem: many types of data have too many attributes to be
visualized or manipulated conveniently.
o For example, a single microarray experiment may have 6,0008,000 genes.
o PCA is a method for reducing the number of attributes
(dimensions) of numerical data while attempting to preserve the
cluster structure.
o After PCA, we hopefully get the same clusters as we would if we
clustered the data before PCA.
o After PCA, plots of the data should still have the clusters falling
into obvious groups.
o By using PCA to reduce the data to 2 or 3 dimensions, off-theshelf geometry viewers can be used to visualize data.
o Consider the data as an m by m matrix in which each cell is the
covariance between attribute i and j.The eigenvectors
corresponding to the d largest eigenvalues of this matrix are the
“principal components”By projecting the data onto these vectors,
one obtains d-dimensional pointsConsider