Download A Survey on Consensus Clustering Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A Survey on Consensus Clustering Techniques
Anup K. Chalamalla
School of Computer Science, University of Waterloo
Email: [email protected]
Abstract—Consensus clustering is an important elaboration of
the classical clustering problem, in which multiple clusterings
of a dataset are consolidated in to a single clustering. The
different clusterings are obtained from different runs of the same
algorithm or different algorithms. Formally, given r clusterings
of a dataset, λ1 , . . . , λr , the objective is to produce a single
clustering λ̂ that agrees as much as possible with the r clusterings.
In this survey, we describe and classify various approaches
to the problem of consensus clustering. We discuss different
formulations of the problem, consensus functions and efficient
algorithms to compute them, and specific applications addressed
in the literature.
I. I NTRODUCTION
Clustering, an important task in data analysis with applications in data mining, image analysis, bioinformatics and
pattern recognition, is the assignment of a set of objects into
groups (called clusters) so that objects in the same cluster are
similar, while objects in different clusters are dissimilar. This
task assumes that there is some well-defined distance measure,
which will determine how the similarity of two objects is
calculated. There is also a quality measure that captures
the intra-cluster similarity and inter-cluster dissimilarity. The
primary goal of a clustering algorithm is to optimize this
quality measure. There are many approaches to improve the
quality of clustering, of which consensus clustering is an
important approach.
Consensus clustering combines multiple clusterings of a
dataset in to a single clustering which is better in some
sense than the input clusterings. Consensus clustering is
known by many different names such as cluster ensemble,
cluster aggregation, clustering combination in different areas
of research: machine learning [1], [2], pattern recognition [3],
bioinformatics [5], and data mining [6]. We next discuss the
motivation and application areas of consensus clustering.
A. Motivation and Applications
Several different types of clustering techniques exist in the
literature such as iterative refinement approaches, e.g., SOM
and K-Means [4], Hierarchical Clustering [7], Subspace Clustering [8], etc., which have been effective to some extent in
several applications. However, there are different shortcomings
with each of them such as increase in time complexity for
large number of dimensions and data objects, fuzziness in
the distance measure, number of clusters not known a priori,
sensitivity to initial settings (e.g., K-Means), getting stuck in
the local optima, lack of robust techniques to validate the
clustering results, etc. Consensus clustering tries to address
many of these shortcomings by using a consensus function
to combine the multiple clusterings. We discuss some of the
major application areas here.
Improve Quality and Robustness. Iterative refinement
algorithms such as K-Means and EM-algorithm are sensitive
to the choice of the initial seed clusters. Hence, running KMeans with different seeds may yield very different clusterings
of the same data objects. It is observed that multiple weak
clusterings can be combined into a stronger one by computing
a consensus among the resulting clusterings of multiple runs
of an algorithm, e.g., K-means, seeded with different initial
centers [10]BradleyFayyad. Similarly, clusterings generated by
different algorithms such as density based, K-means, fuzzy
c-means, graph-partitioning based, etc. can be aggregated to
obtain gains in clustering quality.
Distributed and Privacy-Preserving Clustering. Nowadays, applications require processing of massive data and
hence data is often distributed, e.g., a large customer database
partitioned vertically and stored in different geographic locations (column-distributed). Different clusterings of the same
data are generated on different sets of attributes. There is
a need to combine them to obtain a clustering that agrees
with all the different clusterings. Consensus clustering can
also be employed in privacy-preserving scenarios where the
distributed computing objects can only share certain amounts
of higher level information such as cluster labels, or a limited
number of observed features of each object. For example,
in gene function prediction, separate gene clusterings can
be obtained from diverse sources such as gene sequence
comparisons, and combinations of DNA microarray data from
many independent experiments. Each clustering hence shares
only specific aspects of the data and the goal is to integrate
them to obtain a unified clustering.
Identifying the correct number of clusters. Automatic
identification of appropriate number of clusters is an important
research problem [12] [11]. Previous approaches impose a hard
constraint on the quality or the distance measure in order to
determine the number of clusters. For example, in agglomerative algorithms one can impose a bound on the distance
beyond which no pair of clusters will be merged. Some of the
approaches we discuss provide ways to automatically select
the number of clusters. The various clusterings input to the
consensus function can have different number of clusters and
the consensus function itself can determine a different number
of clusters based on the agreement of similarity between the
objects in the input clusterings. For example, if many input
clusterings place two objects in the same cluster, then a good
consensus function will not split these two objects.
Handling Missing Information. There are many different
types of data, that include categorical attributes, attributes
with incomparable values, constantly changing data, or has
missing attribute-values. There are also legacy clusterings that
were often provided by human experts, and cluster labels are
available for old data while no results are available on new
data. These situations can lead to missing or incorrect cluster
labels for objects in certain clusterings. Consensus clustering
provides a framework to account for the missing labels and
missing values in data objects.
B. Challenges
We summarize the key challenges raised by the problem of
consensus clustering as follows:
• To explore the space of possible consensus clusterings
efficiently to determine the best consensus clustering
• To model the similarities in input clusterings and design
effective consensus functions accordingly.
The remainder of this survey is organized as follows. In
Section 2, we classify and discuss various formulations of
consensus clustering approaches. We also discuss the consensus functions and methods to compute them. In Section
3 we discuss comparison of different approaches, their complexity analysis, and their strengths and drawbacks. Section
4 discusses open problems and proposes potential application
areas. We conclude the survey in Section 5.
II. A PPROACHES TO C ONSENSUS C LUSTERING
In this section we discuss the classification of various
formulations of consensus clustering and consensus functions.
First, we adopt a notation that more or less captures the ideas
of all the formulations. Let χ = {x 1 , x2 , . . . , xn } denote
a set of n objects. A partitioning of these n objects into
k clusters can be represented as a set of k sets of objects
{Cj |j = 1, . . . , k} or as a label vector λ q ∈ Nn . For each
xi , we use Cq (xi ) to denote the label of the cluster to which
object xi belongs, i.e., C q (xi ) = j if and only if x i ∈ Cqj . A
clusterer Φq is a clustering algorithm that generates the label
vector λq given χ. Let kq be the number of clusters in λ q . A
set of r labelings Λ = {λq |q ∈ 1, . . . , r} is combined into a
single labeling λ̂ using a consensus function Γ. The general
architecture of consensus clustering is given in 1.
λ3 = (1, 1, 2, 2, 3, 3, 3), and λ4 = (1, 2, ?, 1, 2, ?, ?). An
inspection suggests that a reasonable consensus clustering is
(2, 2, 2, 3, 3, 1, 1). Here, λ4 has missing labels. Each clustering
follows a different labeling scheme, for e.g., λ 1 and λ2
are the same with different labelings. For a labeling with
k distinct clusters there are k! equivalent representations as
integer label vectors. Hence, a common assumption is that
a labeling scheme follows the two rules: (i) C 1 = 1; (ii)
∀i = 1, . . . , n − 1 : Ci + 1 ≤ maxj=1,...,i (Cj ) + 1. Hence,
a labeling λq can be transformed in to an equivalent labeling
using a uniform scheme for all clusterings.
A. Graph-Partitioning Approaches
Problem of Graph-Partitioning. Given a weighted graph
G, the goal is to partition it in to k disjoint clusters of vertices.
Unless a given graph has k, or more than k, strongly connected
components, any k-way partition will cross some of the graph
edges. The sum of the weights of these crossededges is
defined as the cut of a partition, P : Cut(P, W ) =
W (i, j)
where vertices i and j do not belong to the same cluster. The
goal of a graph partitioning algorithm is to minimize the cut
of the k-way partition.
Two types of graph-partitioning techniques for consensus
clustering are discussed in the literature: 1) instance-based
graph partitioning (IBGF) in which a similarity metric is
induced based on the number of clusterings that cluster pairs
of objects together, 2) cluster-correspondence based graph
partitioning (CBGF) in which a measure is induced based on
the similarity between clusters in two different clusterings. We
discuss the approaches which adopt one of these techniques
and a hybrid approach in this section.
1) Objective Functions: The consensus functions that are
discussed later try to optimize the objective function. The
objective functions capture the similarity between the input
clusterings at the instance level or cluster level.
Mutual Information. The mutual information metric is
proposed with the cluster ensemble framework by Strehl et
al. in [1]. Mutual information, a measure to quantify the
statistical information shared between two distributions, is
used to quantify the similarity between two clusterings [13].
Let X and Y be the random variables described by the cluster
labeling λa and λb , with ka and kb clusters respectively. Let
I(X, Y ) denote the mutual information between X and Y ,
and H(X) denote the entropy of X. Thus the normalized
(a)
mutual information N M I(X, Y ) is √ I(X,Y ) . Let nh be
H(X)H(Y )
Fig. 1.
Consensus Clustering
First, we consider an example with 4 input clusterings: λ1 = (1, 1, 1, 2, 2, 3, 3), λ2 = (2, 2, 2, 3, 3, 1, 1),
(b)
the number of objects in cluster C h according to λ a , and nl
be that in cluster C l according to λ b . Let nh,l denote the
number of objects that are in C h according to λ a as well as in
Cl according to λ b . Then, after substituting for I and H, the
normalized mutual information estimate Φ (N MI) is given by:
ka kb
n.nh,l
h=1
l=1 nh,l log( n(a) n(b) )
(N MI)
h
l
Φ
(λa , λb ) = (a)
(b)
ka (a)
kb (b)
nh
n
( h=1 nh log( n ))( l=1 nl log( nl ))
(1)
Based on this pairwise measure of mutual information, the
optimal combined clustering λ (k−opt) is the one that has
maximal mutual information with all individual labelings λ q
in Λ, given k to be the number of clusters in the consensus
clustering. In other words:
λ(k−opt) = argmaxλ̂
r
Φ(N MI) (λ̂, λq )
(2)
q=1
where λ̂ goes through all possible k-partitions. The authors
show that the above optimization problem is hard, and a naı̈ve
solution is unthinkable. Even greedy approaches have their
own drawbacks, such as strong dependency on initial settings
and generating a poor local optima as solution. The authors
propose three techniques based on graph partitioning, CSPA,
HSPA, and MCLA, which we discuss later.
Disagreement Metric. The disagreement metric is defined
in the cluster aggregation framework in [14]. As in the cluster
ensemble framework, cluster aggregation defines a distance
measure between the two clusterings. Let d xi ,xj (λ1 , λ2 ) denote a boolean function whose value is 0 if λ 1 and λ2
both put xi and xj in the same cluster, and 1 otherwise.
Basically, this function measures the disagreement between
two different clusterings on two data objects. The distance
between two clusterings is defined as in Eq 3. It follows
that the problem of clustering aggregation, given a set of
clusterings Λ, computes a new clustering λ̂ that minimizes
the total number of disagreements with Λ, given by the sum,
r
q=1 dχ (λq , λ̂). The distance function (Equation 3) satisfies
a number of properties such as triangle inequality.
dxi ,xj (λ1 , λ2 )
(3)
dχ (λ1 , λ2 ) =
(xi ,xj )∈χ×χ
2) Graph Partitioning Techniques: Cluster-based Similarity Partitioning Algorithm (CSPA). An n × n boolean
similarity matrix on the n objects is built for each clustering
λq , where an entry 1 indicates objects are in the same cluster
and 0 indicates otherwise. A cumulative similarity matrix S
is obtained from r such boolean matrices where each entry is
the fraction of clusterings in which two objects are clustered
together. A similarity-graph is induced from this matrix whose
edge-weights correspond to the entries in S, and METIS [15]
is used to partition the graph and obtain a consensus clustering
of the objects.
Hypergraph-Partitioning Algorithm (HSPA). A set of
input clusterings Λ are transformed in to a hyper-graph, in
which the vertices are objects to be clustered and a hyperedge
connects a set of objects belonging to the same cluster. The
problem of consensus clustering is then reduced to finding
the minimal-cut of a hypergraph. Standard hyper-graph partitioning algorithms (e.g., HMETIS) combined with NMI as
objective function to control the partition size are used to
obtain the consensus clustering.
Meta-Clustering Algorithm (MCLA). In this approach,
several hyperedges, each representing a cluster, are grouped
together and collapsed in to a single hyperedge. If the number
r
of hyperedges in the hypergraph are
j=1 kj , k collapsed
hyperedges are generated using NMI as the similarity measure
between clusters to combine the hyperedges.
Hybrid Bipartite Graph Formulation (HBGF). HBGF is
proposed by Fern et al. [16] claims to take a hybrid approach
by combining the ideas of instance-based graph partitioning
(IBGF) and cluster-based graph partitioning (CBGF). In this
approach, the authors formulate the cluster ensemble problem
as partitioning a weighted bipartite graph, where the two sets
of vertices in the bipartite graph correspond to 1) V C , the
set of all clusters in all input clusterings 2) V I , the set of n
data objects. If the vertices i and j are both clusters or both
objects, W (i, j) = 0 otherwise if object i belongs to cluster
j, W (i, j) = W (j, i) = 1 and 0 otherwise.
To illustrate the effectiveness of a hybrid approach, consider
two pairs of instances (A, B) and (C, D), we assume that A and
B are never clustered together in the ensemble and the same is
true for pair (C, D). However, the instances A and B are each
frequently clustered together with the same group of instances
in the ensemble, i.e., A and B are frequently assigned to two
different clusters that are similar to each other. In contrast, this
is not true for C and D. Intuitively we consider A and B to be
more similar to one another than C and D. However, IBGF will
fail to differentiate these two cases and assign both similarities
to be zero. This is because IBGF ignores the information
about the similarity of clusters while computing the similarity
of instances. Similarly, CBGF has its own drawbacks. The
hybrid approach integrates the similarity between instances
and similarity between clusters simultaneously. It uses two
graph partitioning techniques, spectral graph partitioning [17]
and METIS to compute the consensus clustering.
There are other graph partitioning approaches and objective
functions that are slight variations of the ones discussed above.
We leave the details of those approaches to possibly an
extended version of this survey.
B. Probabilistic Approaches
In this section we discuss the probabilistic approaches to
consensus clustering. As opposed to the graph-partitioning
approaches, in probabilistic approaches the objective functions
are tightly coupled with the consensus functions that optimize
them.
1) Bayesian Cluster Ensembles: The Bayesian Cluster Ensembles (BCE) proposed in [18] uses a Bayesian approach to
consensus clustering. It treats all the input clustering results
for each object as a feature vector with discrete feature
values, and learns a mixed-membership model from such a
feature representation. Figures 2(a) and 2(b) show B, the
matrix representation of cluster assignments of objects by
different input clusterings. The distance-based approaches
process the clusterings column-wise (Fig. 2(a)), where as BCE
processes them row-wise (Fig. 2(b)). The consensus clustering
problem then becomes finding a clustering λ̂ of the objects
{x1 , . . . , xn } with feature vectors as rows of B. BCE is
defined as a mixture model which generates the matrix B.
Assuming that there are K consensus clusters, each object x i ’s
(a)
(b)
Fig. 2.
Matrix Representation
cluster ids are drawn from a finite mixture model θ i over the
K clusters. And, θi is sampled from a Dirichlet distribution,
with parameter α. Further, the latent variable corresponding to
each consensus cluster id h follows a discrete distribution β hj
each for input clustering λ j over its cluster ids {1, . . . , kj }.
Hence, if an object x i belongs to consensus cluster h for
λj , its cluster id xij = s ∈ [1, kj ] will be determined
by the distribution β hj (s) = p(xij |h), where βhj (s) ≥ 0,
kj
s=1 βhj (s) = 1. Let zij be the latent variable denoting
that object xi belongs to cluster h for c j . Hence, given the
model parameters α, β = β hj , [h]k1 , [j]r1 , the joint probability
distribution over the latent variables θ i , zi and observed values
{xij , [i]n1 , [j]r1 } is given by:
p(xi , θi , zi |α, β) = p(θi |α)
r
p(zij = h|θi )p(xij |βhj ),
j=1,∃xij
(4)
where ∃xij denotes that there exists a j t h input clustering result for xi (there may be no label for x i in some
clusterings). Given the observable matrix B, the goal is to
estimate the mixed-membership θ i , i ∈ [1, n] of each object
to the consensus clusters. The model parameters α and β
are unknown, hence they have to be estimated such that
the likelihood of observing B is maximized. Typically, EM
algorithm can be used by alternating between calculating the
posterior over latent variables p(θ i , zi |xi , α, β) and updating
the parameters until convergence. However, computing the
posteriors in closed form is found to be intractable. Hence, the
authors employ two known techniques, variational inference
approximation and gibbs sampling, to compute the posterior
distribution.
As discussed about the variational inference techniques (in
class) [19], the posterior distribution is approximated by a
family of variational distributions to compute a lower bound L
of the log-likelihood, log(p(x i |α, β)). The variational distributions are obtained by introducing new variational parameters
φ, γ, and choosing an approximating distribution for x i ’s. Variational EM-algorithm is used to maximize the lower bound.
The algorithm starts with some initiations for the parameters
α, β, finds the best variational parameters that maximizes
L. The M-step uses the computed variational parameters to
maximize L over α, β to find new estimates for them. These
two steps repeat until convergence is reached. The paper also
proposes specialized EM-algorithms for row-distributed and
column-distributed cluster ensembles, for which we refer the
readers to the original paper. The paper also proposes gibbs
sampling as an approach to compute the posterior distribution,
assuming a Dirichlet prior over β.
2) Mixture Model for Consensus Clustering: As in BCE,
this paper which defines a mixture model for consensus
clustering [20] views the cluster labels of an object according
to different input clusterings as a set of new features associated
with the object. Let xij = λj (xi ) be the cluster label assigned
by the j th clustering to data object x i , then xi follows a
finite parametric mixture model (Eqn 7) with components
corresponding to the K consensus clusters. The data {x i }
is generated by first drawing a component according to the
probability mass function α m , and then sampling a point from
the distribution p m (x|θm ). Given the data x = {xi }ni=1 , in
which each variable x i is assumed to be independent and
identically distributed, the log likelihood function over the
parameters Θ = α1 , . . . , αk , θ1 , . . . , θk is given as follows.
The goal is to find the parameters which maximize the
likelihood function.
p(xi |Θ) =
k
αm pm (xi |θm )
m=1
n
(5)
p(xi |Θ)
(6)
αm pm (xi |θm )
(7)
log L(Θ|x) = log
i=1
=
n
i=1
log
k
m=1
As in the BCE, the maximum likelihood problem cannot
be solved in a closed form when all the parameters are
unknown. Hence, EM algorithm is applied on the equation for p(x|Θ), after assuming
assumption
r independence
j
j
p
(x
|θ
)
where
each
to simplify pm (xi |θm ) to
j=1 m ij m
j
j
pm (x|θm ) multinomial(kj ) and kj is the number of clusters
in λj . With each xi , a hidden variable z i = {zi1 , . . . , zik }
is introduced, such that z im = 1 if xi belongs to the m th
component and z im = 0, otherwise. The EM algorithm
starts with an initial guess for the parameters in Θ. The Estep computes the expected values of the hidden variables
E[zim ] and the M-step maximizes the likelihood by computing
new estimates of the parameters. The convergence criteria is
obtained from the improvement in the amount of likelihood
probability between two M -steps. And the consensus clustering solution is obtained from the expected values, E[z im ].
Once convergence is achieved, an object x i is assigned to the
component which has the largest value in z i . Next, we discuss
the non-parametric Bayesian cluster ensemble approach.
3) Non-parametric Bayesian Cluster Ensembles: The nonparametric Bayesian cluster ensembles (NBCE) [21] is similar
in spirit to BCE, except that it uses a Dirichlet Process mixture
model to sample the data. We have the clustering matrix B
as in BCE, where the row vector x i = {xij |j ∈ [1, r]} is a
new feature vector representation for the i th data object. The
xi ’s are generated using a Dirichlet process mixture model
with α0 as the concentration parameter and G 0 as the base
measure using truncated stick breaking (TSB) construction.
The TSB construction stops at the level K. Let an infinite
sequence of random variables be defined as
v k Beta(1, α0 ).
k−1
Let π = πk |k = 1, 2, . . . , ∞ where πk = vk j=1 (1 − vj ) be
the mixing proportions of the infinite number of components.
The TSB truncates after iterating for K times by setting v K =
1 which automatically makes π k = 0, k > K.
Let, the probability of generating a cluster ID x ij = kj
Kj
by λj for xi be given by θ ijkj , where
kj =1 θijkj = 1.
And, let xi = {xij = kj |j ∈ [1, . . . , r]} and θij =
{θij |j ∈ [1, . . . , r]}.
{θijkj |kj ∈ [1, . . . , Kj ]} and θi = r
Then xi is generated with probability j=1 θijkj . Since the
truncation level is K, there are K distinct θ i , denoted as θ k∗ ,
k ∈ {1, . . . , K}. Further, θ k∗ is sampled from G0 . Hence, in
addition to πk , an indicator variable z i is associated with each
object xi to indicate which θ k∗ is assigned to xi . A consensus
cluster is defined as a cluster of objects associated with the
same θk∗ . Further, the algorithms assume a Dirichlet prior
π Dir( αK0 , . . . , αK0 ).
The goal is to compute the components of the distribution
P (X, Z, π, θ∗ |α0 , G0 ), where X = {xi |i ∈ [1, n]} and Z =
{zi |i ∈ [1, n]}. The approach discussed is the paper is to
apply gibbs sampling after marginalizing π and θ ∗ . The paper
also proposes variational inference techniques similar to that
of BCE.
C. Relabeling and Voting Approaches
The voting approach is the third kind of known approaches
to solve the consensus clustering problem. It first solves
the label correspondence problem that we discussed at the
beginning of this paper. This approach assumes that all the
input clusterings have the same number of clusters and so
as the target consensus clustering. The idea is to choose a
reference clustering among the given input clusterings, and
for each other clustering, the labels of objects are permuted
to obtain best agreement between an input clustering and the
reference clustering. For a clustering with k labels, there are k!
equivalent labelings. Hungarian algorithm can be employed to
achieve a O(k 3 ) solution for the cluster re-labeling problem.
After solving the cluster relabeling problem, a voting algorithm can be employed to determine the consensus cluster id
of each object [3] [22].
III. C OMPARISON OF C ONSENSUS C LUSTERING
T ECHNIQUES
In this section we compare the various techniques by
their computational complexity, performance and accuracy of
consensus clusterings generated.
A. Complexity Analysis
The complexity of the graph-partitioning techniques discussed by Strehl et al. [1] (Section 2.1) depends on the
complexity of the partitioners used, such as (H)METIS. The
worst-case complexity of CSPA is given by O(nK 2 r), that
of HGPA is O(nKr) and MCLA is O(nK 2 r2 ). The complexity of the HBGF partitioning technique is O(nK). The
earlier three graph-partitioning techniques are either based
on instance-based graph partitioning (IBGF) or cluster-based
graph partitioning ideas (CBGF). HBGF leverages both the
ideas and hence achieves better running time over the other
approaches.
Coming to the complexity of the probabilistic approaches,
the methods used are either variational Bayesian inference
techniques or sampling techniques. Variational inference techniques are basically approximation techniques used to approximate intractable integrals in Bayesian inference, and their
efficiency is well-known in the literature. The complexity of
the voting approaches in O(K 3 ).
B. Accuracy Analysis
1) Graph Partitioning Techniques: For comparing CSPA,
HGPA, and MCLA, a random number generator is used to
generate r noisy labelings for a dataset and the labelings
are fed to each of the techniques. The resulting consensus
labelings are evaluated by comparing its NMI with all the input
labelings (φ(AN MI) (Λ, λ̂)) and all possible cluster labelings of
the dataset. It is observed that as the noise increases the NMI
measure for λ̂ decreases, and HGPA performs the worst among
the three algorithms. All the three algorithms proposed here
are either in the category of IBGF or CBGF.
HBGF avoids the pitfalls of both IBGF and CBGF, by
considering the similarity of instances and similarity of clusters simultaneously. The HBGF uses NMI to evaluate the
result clusterings by comparing the three algorithms based
on the HBGF, CBGF, and IBGF formulations to the true
cluster labels. The cluster ensembles are generated by random
subsampling from the datasets, and then clustering the sample
and assigning the objects not in the sample to one of the
clusters based on Euclidean distance to cluster centers. The
maximum NMI value is compared for the three algorithms
over 5 datasets. It is observed that HBGF performs comparably
or significantly better than IBGF and CBGF for all of the
datasets.
2) Probabilistic Approaches: BCE is evaluated over 10
datasets from the UCI machine learning repository. Microprecision is used as a measure to evaluate the accuracy of
a consensus cluster with respect tothe true labels. MicroK
precision M P is defined as: M P = h=1 anh , where K is the
number of clusters and n is the number of objects, a h denotes
the number of objects in consensus cluster h that are correctly
assigned to the corresponding class. The corresponding class
for consensus cluster h is the true class with the largest overlap
with the cluster. The value M P : 0 ≤ M P ≤ 1, with 1
indicating the best possible consensus clustering.
The input clusterings are generated by running k-Means for
2000 times over a dataset of n objects. Further they are divided
in to 100 subsets each with 20 input clusterings thus generating
100 N × 20 base matrices. The maximum and average MPs
are computed from the BCE results. It is observed that the
result clustering generated by BCE always outperforms the
input clusterings in max and average MPs. Further BCE also
outperforms the CSPA and Mixture models in 80% of the
results.
In the mixture model, the results are compared against
CSPA, HGPA ad MCLA graph-partitioning algorithms over
five datasets. The mean error rate of consensus clustering is
used as the measure to compare the algorithms. It is observed
that the mixture model performs better than the CSPA and
HGPA for most of the input clusterings, but MCLA performs
better with the increase in the number of input clusterings.
NBCE is evaluated using F-1 and perplexity measures on
test datasets whose true cluster labels are known. NBCE has
a better F-1 measure compared to CSPA, HGPA, and MCLA
and has better perplexity measure compared to BCE and BCE
is better than Mixture models.
C. Strengths and Drawbacks
We summarize the strengths and drawbacks of different
kinds of approaches to consensus clustering in Table I.
IV. G ENERATING I NPUT C LUSTERINGS
An important problem related to the consensus clustering
is generating diverse input clusterings for empirical studies.
Some of the approaches used in the literature are generating
multiple clusterings using K-means algorithm with different
initializations. Other approaches to data generation include
random sub-sampling and random projection [18].
V. D ISCUSSION AND O PEN P ROBLEMS
Consensus clustering is an active area of research and there
are a number of open problems that are still to be addressed.
We discuss some of the open problems in this section.
Fluctuations in Input Clusterings. We have seen that the
accuracy of many of the algorithms we discussed goes down
as the noise in the input clusterings increases. There may be
a few input clusterings which adversely affect the accuracy of
the consensus clustering. For example, 80% of the input clustetings may agree to a particular consensus clustering and the
remaining 20% may lead to a vastly different clustering. Also,
the accuracy of the consensus clustering depends vastly on the
number of input clusterings, which can be small in number or
very large. It is very difficult to automatically identify the input
clusterings that adversely affect the final clustering. However,
it may be worth investigating the possibility of designing
consensus functions which can minimize the effect of such
clusterings. Further, a framework can be developed in which
a number of consensus clusterings ranked by some scoring
function can be output, and let the user choose from them. It is
also worth investigating a hierarchical approach to consensus
clustering. For example, instead of using the entire set of
input clusterings at once, using subsets of input clusterings
to output different consensus clusterings, and then computing
a consensus over these clusterings. Further, as the accuracy
of Dirichlet process proves to be better than the mixture
models and parametric approach, it is worth investigating the
application of Pitman-Yor processes to consensus clustering.
In the approaches discussed in this survey, it is assumed
that all the input clusterings are equally important. It is
possible that certain clusterings are more important than other
clusterings. It is worth investigating how the bias of certain
clusterings can be modeled.
Applications in Databases and Data Mining. Consensus
clustering has many applications in databases and web mining.
Problems in bioinformatics such as clustering gene expression
data have been discussed in the literature [23]. An important
application of consensus clustering is outlier detection. Though
traditional clustering techniques can be used for outlier detection, the quality and robustness of outlier detection improves
with consensus clustering. Multiple runs of same or different
algorithms can be used to generate multiple clusterings of the
data, from which consensus about an object can be formed to
determine if it is an outlier.
In web mining, clustering the search results is an important
task. Web search engines such as Google provide the users
with the facility to observe similar documents for the documents retrieved as search results. Consensus clustering can be
used to improve the accuracy in finding similar documents.
The problem of determining the number of clusters is still a
difficult problem. Though the Non-parametric Bayesian Cluster Ensemble approach [21] provides a way to automatically
determine the number of consensus clusters, it appears to
depend on the number of levels up to which the stick-breaking
condition is applied. It can be investigated how traditional
techniques [11], [12] can be combined with consensus functions to automatically determine the number of consensus
clusters.
VI. C ONCLUSION AND F UTURE W ORK
Consensus clustering is an important elaboration of the
classical clustering problem and has emerged as an important
approach to improving the quality of clustering results. Several
approaches have been proposed independently to address the
problem of consensus clustering. The idea is to use a consensus function to compute a clustering that is a better fit than
the input clusterings.
In this survey we discussed and classified major approaches
to consensus clustering. We discussed the motivation and
application areas of consensus clustering. Three major kinds of
approaches discussed in the literature are graph-partitioning,
probabilistic approaches and voting approaches. We provided
formulations for different kinds of approaches and discussed
the consensus functions. We provided an analysis of the
complexity and accuracy of various approaches. We further
discussed and compared the strengths and drawbacks of various approaches. The probabilistic approaches are by far the
most useful since they address most variations of the consensus
clustering problem, such as handling missing values, rowdistributed and column-distributed clustering. We intend to
publish an extended version of this survey discussing the
algorithms, strengths and drawbacks and application areas for
each of the approaches in greater detail.
R EFERENCES
[1] A. Strehl and J. Ghosh Cluster Ensembles — A Knowledge Reuse Framework for Combining Multiple Partitions, Journal of Machine Learning
Research (JMLR) 3:583-617 (2002).
Technique
Graph Partitioning Techniques [1] [16]
Probabilistic Approaches [20] [18] [21]
Strengths
• Uses an objective function to control the
partition size.
• Handles missing cluster labels in the
input clusterings.
• Handles column-distributed cluster ensembles automatically.
•
[1] uses a technique called supraconsensus function to select the best
consensus function for the data.
• HBGF further improves the accuracy
by combining the similarity between instances as well as clusters.
• Since the techniques use scalable algorithms like METIS and Spectral graph
partitioning, they are scalable with the
number of input clusterings and number
of clusters per clustering.
• The computational complexity is reasonably good, and the clusters obtained are
stable and robust.
•
•
•
•
•
Voting Approaches [22] [3]
•
Drawbacks
• There is no automatic way of detecting
the number of consensus clusters. K is
manually determined.
• It is observed that, for CSPA and HGPA,
with the increase in the noise in labelings, the accuracy of the result clustering
goes down.
• It is not a very effective approach for
row-distributed cluster ensembles.
In most cases, the accuracy of the approaches is better than the graph partitioning algorithms.
It can handle missing values in the clusterings.
It can handle both the row-distributed
and column-distributed cluster ensembles. BCE [18] also proposes rowdistributed and column-distributed EM
algorithms.
The number of consensus clusters is
automatically determined from the observations.
The techniques are scalable.
•
It solves the label correspondence problem. Many of the approaches make simplifying assumptions instead of actually
solving it.
•
•
•
•
The accuracy of the mixture models is
not better than the graph partitioning
algorithms when the number of input
clusterings is large.
The mixture models suffer from overfitting. This can be overcome using
Bayesian approaches.
Sampling requires time, as some times
a large number of input clusterings are
needed to obtain reasonable accuracy.
High computational cost.
Performs poorly in the presence of
touching clusters (clusters without clear
boundaries).
TABLE I
C OMPARISON OF C ONSENSUS C LUSTERING T ECHNIQUES
[2] X. Z. Fern, Carla E. Brodley: Random Projection for High Dimensional
Data Clustering: A Cluster Ensemble Approach, ICML 2003:186-193.
[3] Ana L. N. Fred, Anil K. Jain: Data Clustering Using Evidence Accumulation, ICPR 2002:276-280.
[4] J. A. Hartigan, Clustering Algorithms: John Wiley, 1975.
[5] Vladimir Filkov, Steven Skiena: Integrating Microarray Data by Consensus Clustering, ICTAI 2003:418-425.
[6] Alexander P. Topchy, Martin H. C. Law, Anil K. Jain, Ana L. N. Fred:
Analysis of Consensus Partition in Cluster Ensemble, ICDM 2004:225232.
[7] Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome: ”14.3.12 Hierarchical clustering”, The Elements of Statistical Learning, 2nd ed., New York:
Springer. pp. 520528, 2009.
[8] Kailing, Karin; Kriegel, Hans-Peter; Kroger, Peer: Density-Connected
Subspace Clustering for High-Dimensional Data, SDM 2004: 246257.
[9] Paul S. Bradley, Usama M. Fayyad: Refining Initial Points for K-Means
Clustering, ICML 1998:91-99.
[10] Alexander P. Topchy, Anil K. Jain, William F. Punch: Combining
Multiple Weak Clusterings, ICDM 2003:331-338
[11] P. Smyth: Model selection for probabilistic clustering using crossvalidated likelihood, Statistics and Computing 10, 1: 63-72.
[12] Hamerly, G. and Elkan, C.: Learning the k in k-means, NIPS 2003.
[13] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory,
Wiley, 1991.
[14] Aristides Gionis, Heikki Mannila, Panayiotis Tsaparas: Clustering Aggregation, ICDE 2005:341-352.
[15] George Karypis, Vipin Kumar: Multilevel k-way Partitioning Scheme
for Irregular Graphs, J. Parallel Distrib. Comput. (JPDC) 48(1):96-129
(1998).
[16] Xiaoli Zhang Fern, Carla E. Brodley: Solving cluster ensemble problems
by bipartite graph partitioning, ICML 2004.
[17] Andrew Y. Ng, Michael I. Jordan, Yair Weiss: On Spectral Clustering:
Analysis and an algorithm, NIPS 2001:849-856.
[18] Hongjun Wang, Hanhuai Shan, Arindam Banerjee: Bayesian Cluster
Ensembles, SDM 2009:209-220.
[19] Tommi Jaakkola, Michael I. Jordan: Variational Probabilistic Inference
and the QMR-DT Network, J. Artif. Intell. Res. (JAIR) (JAIR) 10:291-322
(1999).
[20] Alexander P. Topchy, Anil K. Jain, William F. Punch: A Mixture Model
for Clustering Ensembles, SDM 2004.
[21] Pu Wang, Carlotta Domeniconi, Kathryn Blackmond Laskey: Nonparametric Bayesian Clustering Ensembles, ECML/PKDD 2010:435-450.
[22] Sandrine Dudoit, Jane Fridlyand: Bagging to Improve the Accuracy of
A Clustering Procedure, Bioinformatics 19(9):1090-1099 (2003).
[23] Stefano Monti, Pablo Tamayo, Jill Mesirov and Todd Golub: A
resampling-based method for class discovery and visualization of gene
expression microarray data.