Download A Mutual Subspace Clustering Algorithm for High Dimensional

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Imperial Journal of Interdisciplinary Research (IJIR)
Vol-2, Issue-12, 2016
ISSN: 2454-1362, http://www.onlinejournal.in
A Mutual Subspace Clustering Algorithm
for High Dimensional Datasets
K. Venkata Narayana1 & Dr. A. Mary Sowjanya 2
1
M. Tech. 2 Assistant Professor, Dept. of Computer Science and Systems Engineering,
A.U. College of Engineering, Andhra University, Visakhapatnam, Andhra Pradesh, (India)
Abstract: Generation of consistent clusters is always
an interesting research issue in the field of
knowledge and data engineering. In real
applications, different similarity measures and
different clustering techniques may be adopted in
different clustering spaces. In such a case, it is very
difficult or even impossible to define an appropriate
similarity measure and clustering criteria in the
union space. The mutual subspace clustering from
multiple clustering spaces is critically different from
subspace clustering in one (union) clustering space.
Mutual subspace clustering finds the common
clusters agreed by subspace clustering in both
clustering spaces, which cannot be handled by the
traditional subspace clustering analysis. The
partitioning model divides points in a data set into k
exclusive clusters and a signature subspaces are
found for each cluster, where k is the number of
clusters desired by a user. This model improves the k
means with the elimination of random centroid
selection, using average pairwise distance and other
parameters to generate consistent clusters. The
experimental results have been recorded on cancer
data set to state the efficiency of mutual subspace
clustering.
Key words: Subspace clustering, high dimensional,
average pair distance, k-means, mutual subspace,
signature subspace
1. Introduction
With the evolution of technology, the amount of
data and also dimensionality of data is increasing
tremendously. From the available huge amount data,
data mining techniques have to be applied in order to
get desired results. But traditional algorithms cannot
safe to high dimensional data because those
algorithms lead to inconsistency in results.
In recent years, study on clustering algorithm is
no longer just a step for improving from a single
clustering algorithm. In 2005, [1] Tung-Shou Chen et
al. proposed H-K (Hierarchical K-means) clustering
algorithm, combining hierarchical clustering method
and partition clustering method for data clustering.
Compared with single algorithm, H-K clustering
algorithm can solve the problem of randomness and a
priority of initial centers selection in k-means
Imperial Journal of Interdisciplinary Research (IJIR)
clustering process, and obtain better clustering
results. But it still needs high computing complexity.
As H-K clustering is more and more widely used
in practical application, and also highlights some
problems. Especially when it used for clustering high
dimensional data, it failed to avoid dimensional
disaster problem, even leading to invalid clustering
result sometimes. In order to solve this problem, this
paper adopts ensemble learning for improving H-K
clustering, in order to obtain better clustering results.
Ensemble learning is an approach that is by means of
training a variety of learning classifiers to solve the
same problem. In recent years, ensemble learning has
been introduced into clustering analysis problem,
named as an ensemble clustering. Ensemble
clustering is an approach that uses a fusion method to
obtain an ensemble clustering result which is relative
to all input clustering results of a given clustering
result set.
To develop effective therapies for cancers, both
clinical data and genomic data have been
accumulated for cancer patients. Exploring clinical
data or genomic data independently may not disclose
the inherent patterns and correlations present in both
datasets. Therefore, it is important to combine
clinical and genomic data and mining knowledge
from both data sources. Clustering is a powerful tool
for revealing underlying patterns without requiring
basic knowledge about the data. To discover phenol
types of cancer, subspace clustering has been widely
used to analyze such data. For a cluster mutual in a
clinical subspace and a genomic subspace, methods
can use the genomic attributes to verify and justify
the clinical attributes. The mutual clusters are more
understandable and more robust. In addition, mutual
subspace clustering is also helpful in combining
multiple sources.
2. Methodology
2.1 Finding signature subspace:
To find the signature subspace for a particular
cluster is a critical issue. To find a subspace U⊆S
which shows the similarity between points in C, a set
of points C is given to a cluster in space S.
Again consider that the attributes are normalized.
Preferably, if a suitable similarity measure simu(x,y)
Page 987
Imperial Journal of Interdisciplinary Research (IJIR)
Vol-2, Issue-12, 2016
ISSN: 2454-1362, http://www.onlinejournal.in
is used to compute the similarity between points x
and y in subspace U, then it measures
∑x,y∈CsimU(x,y) for every non-empty subspace U⊆S,
subspace maximizing the sum of similarities as the
signature subspace.
Moreover, such a method is often impractical for
two reasons. First, describing an ideal similarity
measure is very difficult. Intensely many similarities
or distance measures bias towards low-dimensional
subspaces. Similarities in different subspaces are
frequently unable to compare directly. Second, if S
has m dimensions, then it needs to check 2m -1 sub
spaces. When the dimensionality is very high,
enumerating all subspaces and computing the sums
of similarities in them are frequently very costly.
If U is the signature subspace of C which shows
the similarity among points in C, then the points
must be similar in every attribute in U. Moreover, the
points must be largely dissimilar in attributes not in
U. The average pairwise distance (APD) between
points in C can be used to compute the compactness
of the cluster. That is,
dist (x,y)]/(|C|.|C|-1)/2
APD(C, D) = [∑
x, yϵC
= 2 [∑
D
dist (x,y)]/(|C|.|C|-1)
x, yϵC
D
The average pairwise distance can be used as the
computation of how well the points in C are
clustered on an attribute. The attributes should have a
small APD in the signature subspace of C, while the
attributes which are not in the signature subspace
should have a large APD.
In the average pairwise distance all attributes
should be sorted in ascending order. The first
attribute is the best to show the similarity. Now, the
problem is how to select other attributes that show
the similarity of the cluster together with the first
one. The attributes are in two sets: the ones present
in the signature subspace and the ones which are not
in the subspace. The attributes in the signature
subspace should have a similar APD. The
Chebyshev’s inequality [2] can be applied to
confidently select the attributes in the signature
subspace.
Let D1…..Dn be the attributes in the APD
ascending order. Suppose D1…..Di (1≤i≤n) form the
signature subspace.
The expectation of APD (average pair distance)
is E (APD) = (1/i) ∑j=1 I APD(C, Dj).
Let σ be the standard deviation of
APD(C,D1)…….APD(C, Di). Then, for any attribute
Dj(1≤j≤i) it requires, |APD(C, Dj)-E (APD)| ≤ t.σ,
where t is a small integer. (1-(1/t2)) is the confidence
level of the selection.
Algorithmically, signature subspace U with D1,
the first attribute in the sorted list. Then, the
attributes will be added one by one in the average
pairwise distance ascending order. For each attribute
added, it should check whether the confidence level
Imperial Journal of Interdisciplinary Research (IJIR)
is maintained or not. If even one time the confidence
level is violated, the attributes just added should be
removed, and terminates the selection procedure.
Algorithm:
We have to give O: {o1, o2,…, on}, C is
cluster and C⊆O and threshold t as Input
Do the following…
1. Calculate the average pair distance (APD)
as D for each attribute in C from centroids
of C.
2. Arrange the all attributes of D in ascending
order. Let D1, D2,.., Dn be the sorted list.
3. Calculate the E (Di) and standard deviation
σ for each APD.
4.
Check the following conditions for each
attribute
Let U= {} be the subspace
For i=1 to n DO
If (|E (Di) – APD (Di)| < t. σ) THEN
DO
U=U ∪ {Di}
End if
End for
Return U
2.2 Algorithm for generation of clusters:
In this paper we use k-means algorithm for
full space clustering and PROCLUS [3] for subspace
clustering adopting an iterative greedy search. To
interleave the iterative k-means clustering procedures
in the clustering spaces is the central idea of our topdown mutual subspace clustering. The process starts
with arbitrary k points c1…..ck in the clustering space
S1 as the temporary centers of clusters C1……Ck
respectively. The k centers do not necessarily belong
to O. The points in O will be assigned to the clusters
according to their distances to the centers in space
S1: a point o∈O is assigned to the cluster of the
center closest to ‘o’. This is the first step of the kmeans clustering procedure.
To find mutual subspace clusters, the
information is used in the clustering space S2 to
refine the clusters. Now, it needs to find the signature
subspaces in S2, and also the cluster assignment will
be improved. For each cluster Ci, it finds a subspace
Vi⊆S2 as the signature subspace of Ci in S2 and
calculates the center of Ci in Vi.
In order to improve the cluster assignment,
for each point o∈O, distvi(o, ci) for 1≤i≤k, will be
checked and o will be assigned to the cluster of
closest center in the signature subspace. This forms
the refined clustering.
By using the information in S1, the
clustering in S2 is fed into the refinement. That is, the
signature subspaces and the centers of the clusters in
S1 will be computed, and the cluster assignment will
Page 988
Imperial Journal of Interdisciplinary Research (IJIR)
Vol-2, Issue-12, 2016
ISSN: 2454-1362, http://www.onlinejournal.in
be adjusted. As the iteration progress, the
information in the clustering spaces is used to form
the mutual subspace clusters.
On the cluster assignment if the signature
subspaces in the clustering spaces agree with each
other, then that cluster can become stable. That is, in
the clustering spaces the centers attract the
approximately same set of points to the cluster. But
for some clusters like temperature clusters the
signature subspaces do not agree on each other
because the centers of those spaces changes
continuously due to continuous change in the
members of centers, happens due to continuous
change in temperature.
When we have to terminate the iteration? If
mutual subspace clusters exist, then the above
iterative refinement can greedily approach the mutual
subspace clusters. The above process is because of
the refinement in S1 and S2 and iteratively reduces
variance of clusters in S1 and S2. The exact
termination may require a large number of iterations.
In practice, the mis-assignment rate has been defined
as the portion of points in O that are assigned to a
different cluster for a single iteration. The clustering
gets stable if the signature subspaces of the clusters
become stable and the mis-assignment rate is low.
If the signature subspaces in the clustering
spaces do not change then the iterative refinement
stops, and each round both the signature subspaces in
both clustering spaces S1 and S2 are refined.
Sometimes some points may not belong to any
mutual clusters. If the two clustering spaces do not
agree with each other on those points then the
iterative refinement may fall into an infinite loop. To
detect the infinite loop, for every two clustering
spaces in two consecutive rounds of refinement the
cluster assignments will be compared. For each misassigned point which is assigned to different clusters
in different clustering spaces, if it is repeatedly
assigned to the same cluster in the same clustering
space, and the cluster centers are stable, then the
point does not belong to a mutual cluster and should
be removed. Such a point is called a conflict point.
The centers and the cluster assignment become stable
after removing those conflict points. Then the mutual
subspace clusters can be derived.
Algorithm:
Take the set of points O in clustering spaces S1,S2
and k is the number of clusters specified by user as
the input
Select the k centers randomly c1…..ck in S1; and
assign each attribute to its closest center.
Do the following steps
For each cluster Ci Do
Find the signature subspace in S2 and center
End Do
Assign each point in O to a cluster of its closest
center in signature of subspace S2;
Imperial Journal of Interdisciplinary Research (IJIR)
For each cluster Ci Do
Find the signature subspace in S1 and center
End Do
Assign each point in O to a cluster of its closest
center in signature of subspace S1;
If clustering is stable then remove conflict points
UNTIL clustering is stable
3. Experiments and Results
Generally in the proposed method the following
modules are present.
Data selection: The user must select the valid data
set which contains numeric data.
Divide into subspace: We have to divide the given
input into 2 modules
Basic clustering: Based on the user specified
number of clusters, we have to select that much
number of centers randomly and do the clustering.
Page 989
Imperial Journal of Interdisciplinary Research (IJIR)
Vol-2, Issue-12, 2016
ISSN: 2454-1362, http://www.onlinejournal.in
datasets and Signature Subspace, it is a
Combination of attributes from each data source
which identifies the clusters in most prominent way
and APD (Average Pairwise Distance) is distance
between two points in cluster can be used to measure
the compactness of the cluster. This approach gives
optimal and consistent clusters than traditional
approaches.
The current work can be improved if it can
resolve the zero cluster size i.e. if a centroid does
not get minimum distance with any data item in the
dataset then respect cluster would be zero because
newly computed centroid may not available in
dataset.
5. References
Compute the signature subspaces: After generation
of clusters, compute the signature subspace using
average pair distance (APD).
Compute new clusters: From the above signature
subspaces find the new centroids and form the
clusters based on the generated centroids.
Repeat above two steps until stable clusters are
formed or it reach the user specified iterations.
[1] T.-S. Chen, T.-H. Tsai, Y.-T. Chen, C.-C. Lin and
R.-C.Chen. (2005): 'a combined k-means and
hierarchical clustering method', Proceedings of 2005
International Symposium on Intelligent Signal
Processing and Communication Systems, pp.405408.
[2]. Abramowitz, M. and Stegun, I. A.
(Eds.). Handbook of Mathematical Functions with
Formulas, Graphs, and Mathematical Tables, 9th
printing. New York: Dover, p. 11, 1972.
[3] C.C. Aggarwal, J.L. Wolf, P.S. Yu, C. Procopiuc,
J.S. Park, Fast algorithms for projected clustering, in:
1999 ACM-SIGMOD International Conference on
Management of Data (SIGMOD’99), Philadelphia,
PA, June 1999, pp. 61–72.
[4] M. Ester, R. Ge, B.J. Gao, Z. Hu, B. Ben-Moshe,
Joint cluster analysis of attribute data and
relationship data: the connected k-center problem, in:
SDM, 2006.
[5] [Agarwal et al.1998] Agarwal, R., Gehrke, J.,
Gunopulos, D., Raghavan, “Automatic subspace
clustering of high dimensional data for data mining
applications”, Proceedings of ACMSIGMOD
conference, pp. 94 – 105, 1998
[6] Behera Gayathri, A. Mary Sowjanya
“Dimensionality
Reduction
Using
CLIQUE
and Genetic Algorithm”, proceedings of IJCST Vol.
6, Issue 3, July - Sept 2015
4. Conclusions and Future work
This project work deals with efficient Subspace
Clustering, which finds set of objects that are
homogeneous in subspaces of high- dimensional
Imperial Journal of Interdisciplinary Research (IJIR)
Page 990