Download Cluster

Document related concepts

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
1
Clustering and Network
Park, Jong Hwa
MRC-DUNN
Hills Road Cambridge
CB2 2XY
England
Bioinformatics in Biosophy
:
Next
02/06/2001
What is clustering?
Clustering of data is a method by which large sets of data is
grouped into clusters of smaller sets of similar data.
we see clustering means grouping of data or dividing a large
data set into smaller data sets of some similarity.
http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust1_frm.html
http://www-cse.ucsd.edu/~rik/foa/l2h/foa-5-4-2.html
What is a clustering algorithm ?
A clustering algorithm attempts to find natural
groups of components (or data) based on some
similarity.
The clustering algorithm also finds the centroid of
a group of data sets.
To determine cluster membership, most
algorithms evaluate the distance between a point
and the cluster centroids. The output from a
clustering algorithm is basically a statistical
description of the cluster centroids with the
number of components in each cluster.
Error function is a function that indicates quality
of clustering
Definition:
The centroid of a cluster is a point whose parameter
values are the mean of the parameter values of all the
points in the clusters.
What is the common metric
for clustering techniques ?
Generally, the distance between two points is taken
as a common metric to assess the similarity among
the components of a population. The most
commonly used distance measure is the Euclidean
metric which defines the distance between two
points p= ( p1, p2, ....) and q = ( q1, q2, ....) as :
For sequence comparison, the distances can
be genetic distance (such as PAM)
For clustering Expression profiles, euclidean
distance can be used.
Distances are defined according to problems.
Kinds of Clustering algorithms
Non-hierarchical clustering methods
Single-pass methods
Reallocation methods
K-means clustering
Hierarchical clustering methods
Group average link method (UPGMA)
Single link method
MST Algorithms
complete link method
Voorhees Algorithm
Ward's method (minimum variance method)
Centroid and median methods
General algorithm for HACM
Hierarchical Clustering
Dendrograms used for representation.
• General Strategy is to represent similarity matrix as
a graph, form a separate cluster around each node,
and traverse the edges in decreasing order of
similarity, merging two clusters according to some
criterion.
• Merging criteria:
• Single-link : Merge maximally connected components.
• Minimum Spanning Tree based approach: merge
clusters connected by MST edge with smallest weight.
• Complete-link : Merge to get a maximally complete
component.
Partitional: Single partition is found.
Hierarchical: Sequence of nested partitions is found,
by merging two partitions at every step.
• Agglomerative: glue together smaller clusters
• Divisive: fragment a larger cluster into smaller
ones.
Partitional Clustering
Find a single partition of k clusters based on some
clustering criteria.
• Clustering criteria:
• local : forms clusters by utilizing local structure in
the data. (e.g. Nearest neighbor clustering)
• global: represents each cluster by a prototype and
assigns a pattern to a cluster with most similar
prototype. (e.g. K-means, Self Organizing Maps)
• Many other techniques in literature such as
density estimation and mixture decomposition.
• From [Jain & Dubes] Algorithms for Clustering Data, 1988
Nearest Neighbor Clustering
•
•
•
•
•
•
Input:
A threshold, t, on the nearest-neighbor distance.
Set of data points {x1, x2, ? xn}.
Algorithm:
[Initialize: assign x1 to cluster C1. Set i = 1, k = 1.
Set i = i+1. Find nearest neighbor of xi among the
patterns already assigned to clusters.
• Let the nearest neighbor be in cluster m. If its distance >
t, then increment k and assign xi to a new cluster Ck;
else assign xi to Cm.
• If every data point is assigned to a cluster, then stop;
else go to first step above.
• From [Jain & Dubes] Algorithms for Clustering Data, 1988
Iterative Partitional Clustering
Input:
• K, number of clusters; Set of data points {x1, x2, ,, xn};
• a measure of distance between them (e.g. Euclidean, Mahalanobis);
and clustering criterion (e.g. minimize squared error)
Algorithm:
• [Initialize: A random partition with K cluster-centers.]
• Generate a new partition by assigning each data point to its closest
cluster center.
• Compute new cluster centers as centroids of the clusters.
• Repeat above two steps until optimum value of criterion found.
• Finally, adjust the number of clusters by merging/splitting existing
clusters, or by removing small (outlier) clusters.
• From [Jain & Dubes] Algorithms for Clustering Data, 1988
AVERAGE LINKAGE CLUSTERING:
The dissimilarity between clusters is calculated
using average values.
Unfortunately, there are many ways of calculating an
average! The most common (and recommended if
there is no reason for using other methods) is
UPGMA - Unweighted Pair-Groups Method
Average.
The average distance is calculated from the
distance between each point in a cluster and all
other points in another cluster. The two clusters
with the lowest average distance are joined
together to form the new cluster.
(Sneath, P.H.A. and Sokal, R.R. (1973) in Numerical Taxonomy (pp; 230-234), W.H. Freeman and Company, San Francisco,
California, USA)
The GCG program PILEUP uses UPGMA to create
its dendrogram of DNA sequences, and then uses
this dendrogram to guide its multiple alignment
algorithm.
The GCG program DISTANCES calculates pairwise
distances between a group of sequences.
COMPLETE LINKAGE
CLUSTERING
(Maximum or Furthest-Neighbour Method):
The dissimilarity between 2 groups is
equal to the greatest dissimilarity
between a member of cluster i and a
member of cluster j.
Furthest Neighbour
This method tends to produce very tight clusters
of similar cases.
SINGLE LINKAGE CLUSTERING
(Minimum or Nearest-Neighbour Method): The
dissimilarity between 2 clusters is the minimum
dissimilarity between members of the two clusters.
This methods produces
long chains which form
loose, straggly clusters.
This method has been
widely used in numerical
taxonomy.
WITHIN GROUPS CLUSTERING
This is similar to UPGMA except clusters
are fused so that within cluster variance
is minimised. This tends to produce
tighter clusters than the UPGMA
method.
UPGMA: Unweighted Pair-Groups Method Average
Ward’s method
Cluster membership is assessed by
calculating the total sum of squared
deviations from the mean of a cluster.
The criterion for fusion is that it should
produce the smallest possible increase in
the error sum of squares.
Lance, G. N. and Williams, W. T. 1967. A general theory of classificatory
sorting strategies. Computer Journal, 9: 373-380.
K-Means Clustering Algorithm
This nonheirarchial method initially takes the number of
components of the population equal to the final required
number of clusters.
In this step itself the final required number of clusters is
chosen such that the points are mutually farthest apart.
Next, it examines each component in the population and
assigns it to one of the clusters depending on the
minimum distance.
The centroid's position is recalculated everytime a
component is added to the cluster and this continues
until all the components are grouped into the final
required number of clusters.
Complexity of K-means Algorithm
•Time Complexity = O(RKN)
•Space Complexity = O(N) where R is the number of iterations
•K-Medians Algorithm
K-medians algorithm is similar to K-means
algorithm except it uses a median instead of a
mean
Time Complexity = O(RN2) where R is the
number of iterations
Space Complexity = O(N)
K-Means VS. K-Medians (1)
• K-means algorithm requires a continuous space,
so that a mean is a potential element of space
• K-medians algorithm also works in discrete
spaces where a mean has no meaning
• K-means requires less computational time
because it is easier to compute a mean than to
compute a median
Problems with K-means Clustering
•To achieve a globally minimum error is NP-Complete
•Very sensitive to initial points
•When used with large databases, time complexity can easily become
intractable
•Existing algorithms are not generic enough to detect various shapes of
clusters (spherical, non-spherical, etc.)
Genetic Clustering Algorithm
• Genetic Clustering Algorithms * achieve a
“better” clustering result than K-Means
• Refining the initial points * achieve a “better”
local minimum and reduce convergent time
A Genetic Clustering Algorithm
•"Clustering using a coarse-grained parallel Genetic Algorithm: A
Preliminary Study", N. K. Ratha, A. K. Jain, and M. J. Chung, IEEE, 1995
•Use a genetic algorithm to solve a K-means clustering problem formulated as
an optimization problem
•We can also look at it as a label assignment problem such that the assignment
of {1,2,…,K} to each pattern minimizes the similarity function.
Definition of Genetic Algorithm
• Search based on the “survival of the fittest”
principle [R.Bianchini and et al.,1993]
• The “fittest candidate” is the solution at any
given time.
• Run the evolution process for a sufficiently
large number of generations
Simple Genetic Algorithm
Function GENETIC-ALGO(population, FITNESSFN) returns an individual inputs: population, a set
of individuals (fixed number)
FITNESS-FN, a function that measures the fitness of
an individual repeat
parents = SELECTION(population, FITNESS-FN)
population = REPRODUCTION(parents)
until some individual is fit enough
return the best individual in population, according to
FINESS-FN
Pros and Cons
Pros
• Clustering results are better as compared
to K-means algorithm.
Cons
• Search space grows exponentially as a
function of the problem size.
• Parallel computing helps but not much
Need for better clustering algorithms.
Enormity of data
• hierarchical clusterings soon become impractical
High Dimensionality
• Distance based algorithms become ill-defined because of
the curse of dimensionality.
• Collapse of notion neighborhood --> physical proximity.
• All the data is far from the mean!
Handling Noise
• Similarity measure becomes noisy as the hierarchical
algorithm groups more and more points, hence clusters that
should not have been merged may get merged!
• Handling High Dimensionality
• Reduce the Dimensionality and apply traditional
techniques.
• Dimensionality Reduction:
• Principal Component Analysis (PCA), Latent Semantic
•
•
•
•
Indexing (LSI):
Use Singular Value Decomposition (SVD) to determine the
most influential features (maximum eigenvalues)
Given data in a n x m matrix format (n data points, m attributes),
PCA computes SVD of a covariance matrix of attributes,
whereas LSI computes SVD of original data matrix. LSI is
faster and memory efficient, and has been successful in
information retrieval domain (clustering documents).
Multidimensional Scaling (MDS):
Preserves original rank ordering of the distances among data
points.
Clustering in High Dimensional Data Sets
DNA /Protein/ Interaction data are highdimentional.
• Traditional distance-based approach
• Hypergraph-based approach
Hypergraph-Based Clustering
• Construct a hypergraph in which related
data are connected via hyperedges.
• How do we find related sets of data items?
Use Association Rules!
• Partition this hypergraph in a way such that
each partition contains highly connected
data.
graph
•
•
•
•
•
Definition: A set of items connected by edges. Each item is called a vertex or node. Formally, a graph is a set of
vertices and a relation between vertices, adjacency.
See also directed graph, undirected graph, acyclic graph, biconnected graph, connected graph, complete graph,
sparse graph, dense graph, hypergraph, multigraph, labeled graph, weighted graph, self-loop, isomorphic,
homomorphic, graph drawing, diameter, degree, dual, adjacency-list representation, adjacency-matrix
representation.
Note: Graphs are so general that many other data structures, such as trees, are just special kinds of graphs.
Graphs are usually represented G = (V,E), where V is the set of vertices, and E is the set of edges. If the graph is
undirected, the adjacency relation is symmetric. If the graph does not allow self-loops, adjacency is irreflexive.
A graph is like a road map. Cities are vertices. Roads from city to city are edges. (How about junctions or
branches in a road? You could consider junctions to be vertices, too. If you don't want to count them as vertices,
a road may connect more than two cities. So strictly speaking you have hyperedges in a hypergraph. It all
depends on how you want to define it.)
Another way to think of a graph is as a bunch of dots connected by lines. Because mathematicians stopped
talking to regular people long ago, the dots in a graph are called vertices, and the lines that connect the dots are
called edges. The important things are edges and the vertices: the dots and the connections between them. The
actual position of a given dot or the length or straightness of a given line isn't at issue. Thus the dots can be
anywhere, and the lines that join them are infinitely stretchy. Moreover, a mathematical graph is not a
comparison chart, nor a diagram with an x- and y-axis, nor a squiggly line on a stock report. A graph is simply
dots and lines between them---pardon me, vertices and edges.
Michael Bolton <[email protected]> 22 February 2000
Graph
• Formally a graph is a pair (V,E) where V is
any set, called the vertex set, and the edge
set E is any subset of the set of all 2element subsets of V. Usually the elements
of V, the vertices, are illustrated by bold
points or small circles, and the edges by
lines between them.
hypergraph
• Definition: A graph whose hyperedges connect two
or more vertices.
• See also multigraph, undirected graph.
• Note: Consider ``family,'' a relation connecting two
or more people. If each person is a vertex, a family
edge connects the father, mother, and all of their
children. So G = (people, family) is a hypergraph.
Contrast this with the binary relations ``married
to,'' which connects a man and a woman, or ``child
of,'' which is directed from a child to his or her
father or mother.
General Approach for High Dimensional Data Sets
• Data
• Graph
• Sparse Hypergraph
• Sparse Graph
• Association Rules
• Similarity
• measure
• Partitioning based
• Clustering
• Agglomerative
• Clustering
references
•
•
•
•
•
•
•
•
•
•
•
[1] Usama Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurasamy (eds.),
Advances in Knowledge Discovery and Data Mining, AAAI Press/ The MIT Press, 1996.
[2] Michael Berry and Gordon Linoff, Data Mining Techniques (For Marketing, Sales, and Customer
Support), John Wiley & Sons, 1997.
[3] Sholom M. Weiss and Nitin Indurkhya, Predictive Data Mining (a practical guide), Morgan
Kaufmann Publishers,1998.
[4] Alex Freitas and Simon Lavington, Mining Very Large Databases with Parallel Processing,
Kluwer Academic Publishers, 1998.
[5] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
[6] V. Cherkassky and F. Mulier, Learning from Data, John Wiley & Sons, 1998.
[7] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis, Introduction to Parallel
Computing: Algorithm Design and Analysis, Benjamin Cummings/Addison Wesley, Redwood City,
1994.
Research Paper References:
[1] J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993.
[2] M. Mehta, R. Agarwal, and J. Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining, Proc.
Of the fifth Int. Conf. On Extending Database Technology (EDBT), Avignon, France, 1996.
[3] J. Shafer, R. Agrawal, and M. Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining,
Proc. 22nd Int. Conf. On Very Large Databases, Mumbai, India, 1996.
Gene expression and genetic
network analysis
A gene’s expression level is the number of
copies of that gene’s RNA produced in a
cell, and correlates with the amount of the
corresponding protein produced
DNA microarrays greatly improve the
scalability and accuracy of gene
expression level monitoring – can
simultaneously monitor 1000’s of gene
expression levels
http://www.ib3.gmu.edu/gref/S01/csi739/overview.pdf
Goals of Gene Expression
Analysis
What genes are or are not expressed?
Correlate expression with other parameters
• – developmental state
• – cell types
• – external conditions
• – disease states
Outcome of analysis
• – Functions of unknown genes
• – Identify co-regulated groups
• – Identify gene regulators and inhibitors
• – Environmental impact on gene expression
• – Diagnostic gene expression patterns
Methods for Gene Expression
Analysis
Early processing:
•
•
•
•
– image analysis
– statistical analysis of redundant array elements
– output raw or normalized expression levels
– store results in database
Clustering
• – visualization
• – unsupervised methods
• – supervised methods
Modeling
• – reverse engineering
• – Genetic network inference
Unsupervised Clustering
Methods
Direct visual inspection
• – Carr et al (1997) Stat Comp Graph News 8(1)
• – Michaels et al (1998) PSB 3:42-53
Hierarchical clustering
• – DeRisi et al (1996) Nature Genetics 14: 457460
Average linkage
• – Eisen et al (1998) PNAS 95:14863-14868
• – Alizadeh (2000) Nature 403: 503-511
k-means
• – Tavazoie et al (1999) Nature Genetics 22:281285
Unsupervised Clustering
Methods
SOMs
• – Toronen et al (1999) FEBS Letters 451:142146
• – Tamayo et al (1999) PNAS 96:2907-2912
Relevance networks
• – Butte et al (2000), PSB 5: 415-426
SVD/PCA
• – Alter et al (2000) PNAS 97(18):10101-10106
Two-way clustering
• – Getz et al (2000) PNAS 97(22):12079-12084
• – Alon et al (1999) PNAS 96:6745-6750
Supervised Learning
Goal: classification
• – genes
• – disease state
• – developmental state
• – effects of environmental signals
Linear discriminant
Decision trees
Support vector machines
• – Brown et al (2000) PNAS 97(1) 262-267
Somogyi and Sniegoski, Complexity, 1996
Gene regulation network
models
– Somogyi and Sniegoski (1996) Complexity 1(6)
Boolean models
• – Kaufmann
Weight matrix
• – Weaver et al (1999) PSB 4
Petri nets
• – Matsuno et al (2000) PSB 5
Diff Eq models
• – Chen et al (1999) PSB 4
Gene Network Inference
Methods
Reverse engineering
• – Liang et al (1998), PSB 3:18-29
• – Akutsu et al (1999), PSB 4: 17-28
Perturbation methods
• – Ideker (2000) PSB 5: 302-313
Determinations
• – Kim et al (2000) Genomics 67:201-209
Recent Applications
Gene function assignment
• – Brown et al (2000) PNAS 97(1) 262-267
• – Alon et al (1999) PNAS 96:6745-6750
Cell cycle
• – DeRisi et al (1997) Science 278:680-686
• – Toronen et al (1999) FEBS Letters 451:142-146
• – Alter et al (2000) PNAS 97(18):10101-10106
Cell response to external conditions
• – Alter et al (2000) PNAS 97(18):10101-10106
Cancer therapeutics
• – Butte et al (2000) PNAS 97(22):12182-12186
• – Getz et al (2000) PNAS 97(22):12079-12084
• – Tamayo et al (1999) PNAS 96:2907-2912
Cancer diagnosis
• – DeRisi et al (1996) Nature Genetics 14: 457-460
• – Alon et al (1999) PNAS 96:6745-6750
Microarray Analysis Software
Michael Eisen’s Lab (http://rana.lbl.gov)
Data Analysis
• – Cluster: Perform a variety of types of cluster analysis and
• other types of processing on large microarray datasets.
• Currently includes hierarchical clustering, self-organizing
• maps (SOMs), k-means clustering, principal component
• analysis. (Eisen et al. (1998) PNAS 95:14863)
• – TreeView: Graphically browse results of clustering and
other analyses from Cluster. Supports tree-based and
image based browsing of hierarchical trees. Multiple output
formats for generation of images for publications.
http://www.ib3.gmu.edu/gref/S01/csi739/schedule.html
•
•
•
•
•
•
Informatics
– image analysis
– gene expression raw data
– database issues
– data volumes
– sources of errors
Boolean Network
(Binary Network)
Boolean Genetic Network Modeling
Goals
• Understand global characteristics of
geneticregulation networks
Topics
Boolean Network Models
– terminology
– dynamics
Inference of models from gene expression data
– Cluster Analysis
– Mutual Information
Extension to the model
Patterns of Gene Regulation
Genes typically interact with more than one
Partner
Wiring Diagrams
Three genes: A, B, C
• A activates B
• B activates A and C
• C inhibits A
Many ways to represent interaction rules:
• Boolean (Logical) function
• Sigmoid function
• Semi-Linear models
• etc
http://www.ib3.gmu.edu/gref/S01/csi739/gene_networks.pdf
• The dynamics of Boolean networks of
any complexity are determined by the
wiring and rules, or state-transition
tables.
• Time is discreate and all genes are updated simultaniously
Data Requirements
Data Sources
• – time series
• – different environmental conditions
Fully connected boolean model with N genes
• requires 2^N observations
Boolean model with at most k inputs per gene
• requires O(2^k log(N)) [Akutsu, PSB 1999]
– e.g., 1000 genes, 3 inputs => 80 data points
(arrays)
Reverse Engineering
Given: a (large) set of gene expression
observations
• Find:
– wiring diagram
– transition rules
such that the network fits that observed data
• Example methods
– Cluster analysis
– Mutual information
Information can be quantified: Shannon entropy (H)
Shannon entropy (H)
• can be calculated from the probabilities
of occurrences of individual or
combined events.
The Shannon entropy is maximal if all states are equiprobable H(X|Y) H(Y)
• Mutual information (M): the
information (Shannon entropy)
shared by non-independent elements
Summary
Gene regulation involves distributed function,
redundancy and combinatorial coding
Boolean networks provide a promising initial
framework for understanding gene regulation
networks
Boolean Net and Reverse Engineering
Boolean networks exhibit:
– Global complex behavior
– Self-organization
– Stability
– Redundancy
– Periodicity
Reverse Engineering
– tries to infer wiring diagrams and transition function from observed
gene expression patterns
More realistic network models include
– continuous expression levels
– continuous time
– continuous transition functions
– many more biologically important variables