Download H-D and Subspace Clustering of Paradoxical High Dimensional

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Indian Journal of Science and Technology, Vol 9(38), DOI: 10.17485/ijst/2016/v9i38/101792, October 2016
ISSN (Print) : 0974-6846
ISSN (Online) : 0974-5645
H-D and Subspace Clustering of Paradoxical High
Dimensional Clinical Datasets with Dimension
Reduction Techniques – A Model
S. Rajeswari1*, M. S. Josephine2 and V. Jeyabalaraja3
Bharathiyar University, Coimbatore - 641046, India; [email protected]
Dr. M.G.R. Educational and Research Institute, Chennai - 600095, India; [email protected]
3
Velammal Engineering College, Chennai - 600066, India; [email protected]
1
2
Abstract
Objectives: Heterogeneous High dimensional data clustering is the analysis of data with multiple dimensions. Large
dimensions are not easy to handle. The complexity increases exponentially with the dimensionality. Dimensionality
reduction is the conversion of high dimensional data into a considerable representation of reduced dimensionality that
corresponds to the essential dimensionality of the data. To solve the problem we put forward a general framework for
clustering high dimensional datasets. Methods: Clustering is the method of finding groups of objects, such that the objects in
the group will be similar to each another and different from the objects in other groups. In our framework, a heterogeneous
high dimensional clustering is partitioned into several one or two dimensional clustering phases. Findings: In this paper, a
model is designed in which Hierarchical-Divisive clustering; subspace clustering is used to make non-overlapping clusters
and combined with dimension reduction techniques to reduce the dimensions of paradoxical high dimensional clinical
datasets. Applications: solution for processing the heterogeneous high dimensional dataset such as PCA, LDA, and PSO etc.
Keywords: High Dimensional Data, Hierarchical-Divisive (H-D) Clustering, Subspace Clustering
1. Introduction
Data mining refers to the mining or discovery of new
information in terms of patterns or rules from the large
collection of data. Data mining is a process that takes data
as input and outputs knowledge. Clustering is a process
by which the data are divided into groups called as clusters such that objects in one cluster are closely related
and objects in different clusters are very much contradictory to each other1,2. Figure 1 shows the Data Clusters. In
other words, clusters should have low inter-cluster similarity and high intra cluster similarity. Applying standard
clustering algorithms on the high dimensional datasets
frequently presented a great challenge for traditional data
mining techniques in terms of efficiency and in practical
purposes also. From the distinct distances, the complexity will be increased between the data points and sparsity
*Author for correspondence
of data, which causes “dimensionality disaster problem”
making clustering difficult3. So, the proposed model
should maintain the quality of data and the speed of
processing which will be more effective that the existing
algorithm. Due to its high complexity in computations of
clusters in high dimensional data and with poor cluster
accuracy. So research in the area of clustering introduces a
lot of new concepts such as subspace clustering, ensemble
clustering and H-K clustering process4,5. By applying these
concepts to the heterogeneous high dimensional dataset it will lead to a dimensional adversity problem which
is to be concentrated. Subspace clustering, an extended
traditional clustering model, finds the clusters in various
datasets6. Subspace clustering deal with the detection of
group of clusters that are very scattered within different
subspace of the same dataset. The problem becomes how
to find such subspace clusters effectively and efficiently.
H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques – A
Model
Ensemble clustering ‘the knowledge reuse framework,
proposed by in7. The traditional algorithms for clustering
gives less efficient results when dealing with high dimensional data as it has the advantages such as the “curse of
dimensionality”. The problems which are quoted such
as irrelevant noisy features and sparsity of data should
be completely shortened. The highest priority will be
given to these above problems to provide an advanced
clustering algorithm that will solve and cluster the data
efficiently. We proposed a model with the combinations
of advanced clustering algorithms that will improve
the quality of cluster and speed of processing the large
amount of data. The proposed model combines the three
techniques Hierarchical (Divisive) clustering, subspace
clustering (Proclus) combination with Dimension reduction techniques which may be PCA, SVD, LDA, PSO etc.,
which will improve the cluster efficiency and reduce the
curse of dimensionality.
sent these data as Paradoxical high dimensional Clinical
Datasets. One of the most significant challenges of the
data mining in medical side is to obtain the quality and
relevant clinical trial data. Medical data are complex and
heterogeneous in nature, because it is collected from
various sources such as from the medical reports of laboratory, from the discussion with the patient or from the
review of physicians. The medical information is characteristics of redundancy, multi-attribution, incompletion
and closely related with time.
1.2 Hierarchical Clustering Analysis
Hierarchical clustering and partition clustering are the
basic types of clustering algorithms. Hierarchical clustering, which builds a hierarchy of clusters from the single
link and complete link clustering features. It is further
Classified into agglomerative (bottom-up approach) and
divisive (top-down approach).
Agglomerative Clustering
Hierarchical process that begins with each object or
observation in a separate cluster. In each subsequent
step, the most similar clusters are combined to form a
new cumulative cluster. The iterative process is repeated
until ‘all’ objects are finally combined into a single cluster, from n clusters to 1. As similarity measures decreases
during successive steps, clusters can’t be split, starts with
a single data point. Add two or more clusters recursively
(AGNES).
Figure 1. Data Clusters.
1.1. Paradoxical High Dimensional Clinical
Datasets
Heterogeneous high dimensional dataset is a set of interrelated component which are autonomous in nature. The
attributes present in one component may completely different from attributes in other component datasets which
makes some complications to integrate their semantics
into the overall heterogeneous database. There are different kinds of data systems such as relational or object
oriented databases, hierarchical databases, and network
databases, spread sheets, multimedia databases or file
systems which are combined to form the heterogeneous
databases that referred as legacy database8. Here we repre-
2
Vol 9 (38) | October 2016 | www.indjst.org
Divisive Clustering
Starting with all attributes in a single cluster, then it is
divided into step by step process. From the single cluster
it is seggregated into one or two more additional clusters,
which is having the most dissimilar objects. From the one
cluster is divided into two clusters, and then one of these
clusters is split for a total of three clusters. The iteration
will be continued until all the observations from the singlecluster ranging from 1 cluster to n clusters. DIANA is the
hierarchical divisive clustering algorithm which starts with
big cluster and divides into smaller clusters respectively.
For any set of comparing the clusters of the heterogeneous
high dimensional dataset, the hierarchical cluster analysis will provide the tremendous framework with accurate
solutions. The HCA method helps us to evaluate how many
clusters to be taken or to be considered.
Indian Journal of Science and Technology
S. Rajeswari, M. S. Josephine and V. Jeyabalaraja
Advantage of Hierarchical Clustering Analysis (HCA)
are
Simplicity: With the help of the dendogram structure,
the Hierarchical cluster analysis provides a simple, wideranging depiction of clustering solutions.
Measure of Similarity: HCA can be applied to almost
any type of research question.
Speed: HCA had the advantages of generating an
entire set of clustering solutions in a convenient manner8.
1.4 Subspace Clustering
Subspace clustering is an extended method of attribute subset selection that has shown its strength at high
dimensional clustering. Based on the observation that
different subspaces may contain different, meaningful
clusters. Subspace clustering explores the groups of clusters within different subspaces of the similar data set.
The problem becomes how to find such subspace clusters
effectively and efficiently. Dimension growth subspace
clustering (CLIQUE), dimension-reduction projected
clustering (PROCLUS) and frequent pattern based clustering (pCluster). Clique splits the n-dimensional data
space into non-overlapping rectangular units, identifying
the dense units among these. This is done for each dimension. Clique (Clustering in QUEst) find out the subspaces
of high dimensionality having high density clusters from
the different subspaces in automated manner. PROCLUS
(Projected Clustering) is a dimension–reduction subspace clustering method. From the preliminary stages of
single-dimensional spaces, the PROCLUS will find the
initial evaluation of the clusters in the single-dimensional
attribute space. From the above stages, the dimensions
which are presented in clusters are assigned by specific
weightage values9. These weightage values are passed to
the next iteration for regenerating the clusters. Exploring
the intense regions with all subspaces from the required
dimensionality and exclude the generation of huge quantity of overlapped clusters in projected dimensions of lower
dimensionality. When compared to CLIQUE, PROCLUS
finds non-overlapped partitions of points. The discovered
clusters may help better understand the high-dimensional
data and facilitate other subsequence analyses. Frequent
pattern-based cluster analysis can discover the significant
associations and correlations among data objects in the
clusters. Rather than growing the clusters dimension by
dimension, this will grow sets of frequent item sets, which
eventually lead to cluster description. An advantage of
Vol 9 (38) | October 2016 | www.indjst.org
frequent term-based clustering is that, the automatically
generated description of cluster from the frequent item
sets. Traditional clustering methods produce only clusters
and several processing steps had to be included for generating the cluster descriptions9.
Recently set of works has been done in the area of high
dimensional data, that has been explained briefly in10,11.
Dimensionality Reduction
Feature extraction and feature transformation the most
popular techniques of dimension reduction. Some of the
experimental evaluation leads to that both methods, the
accuracy and effective of data will be affected by the lost
information and feature selection algorithms may found
the difficulty when clusters are found in different subspaces. This type of data motivated the evolution of the
subspace clustering algorithm.
2. Proposed Model
The complete flow diagram of the proposed model shown
in Figure 2 Model of Dimension Reduction. Based on the
flowchart of the proposed model, the following content
will unfold these stages in details:
Figure 2. Model of Dimension Reduction.
Phase 1: Dataset Pre-Processing
Import the dataset for pre-processing, as the clinical
dataset is having many missing values and outliers. Preprocessing is needed to avoid these types of noises and
make the raw data to processed data.
Indian Journal of Science and Technology
3
H-D and Subspace Clustering of Paradoxical High Dimensional Clinical Datasets with Dimension Reduction Techniques – A
Model
Phase 2: H-D Clustering Process
By divisive (Top-down) approach the dataset will be
divided into n clusters from the top. As we given number
of clusters and threshold value the clusters will be formed.
The clusters are represented by the dendogram structure.
By clustering the heterogeneous high-dimensional clinical datasets, overlapping may occur; the clusters will be
formed from the subset of another. So, there is a lack
of conversion in high dimensional to low dimensional
shows Figure 3 H-D clusters of data.
attributes. These numbers of attributes will be clustered
by the H-D clustering algorithm. After the H-D clustering
some of the overlapping clusters are formed. By using the
subspace clustering algorithm these overlapping clusters
will reduce to form the prominent clusters and combined
with dimension reduction techniques the resultant will be
the required reduced data sets. By applying these numbers of above the phases, the proposed model will get the
reduced number of clusters and finally we got the accurate
and efficient reduced number of clinical datasets which
will be very useful to diagnose the problem of a patient.
Figure 3. H-D clusters of data.
Phase 3: The Subspace Clustering Process
By the end of divisive clustering, the overlapped clusters will refine by the subspace clustering process. These
overlapping will have the required number of datasets in
them. By assigning the number of clusters and subspace
determination the process will show the number of clusters present in them. Finally, the reduction of number of
clusters will be evaluated by combining the groups which
are closely and similar to each other shows in Figure 4
Cluster Refining process.
Phase 4: Dimension Reduction Techniques
From the subspace process, the reduced clusters will be
formed. But these reduced clusters are also having several numbers of attributes or dimensions. In combined
with subspace, principal component Analysis, Linear
Discriminant Analysis, Singular value decomposition,
Factor analysis etc., can be used to reduce the multi-attributes datasets. According to our domain knowledge, the
paradoxical clinical datasets, which are said to be heterogeneous high dimensional in nature. When considering
the blood report of a particular patient and scan report
of the particular patient, it shows the different number of
4
Vol 9 (38) | October 2016 | www.indjst.org
Figure 4. Cluster Refining process.
3. Conclusion and Future
Enhancement
Heterogeneous High dimensional dataset processing faces
some complications such as “the curse of dimensionality”
and the sparsity of data in the high dimensional space. The
proposed model provides a solution for processing the
heterogeneous high dimensional dataset which is composition of Hierarchical clustering (divisive), subspace
clustering (Proclus) and Dimension reduction algorithm
such as PCA, LDA, and PSO etc. The hierarchical clusters
Indian Journal of Science and Technology
S. Rajeswari, M. S. Josephine and V. Jeyabalaraja
of the corresponding dataset will pass to subspace clustering generating the subsets of non-overlapping clusters
which results the low dimensional clusters and combined
with dimension reduction techniques reaches the final
stage converting high dimensional or multi-attribute
datasets to lower dimensional clinical datasets. This paper
provides a model for dimension reduction in paradoxical
high dimensional clinical datasets. The future scope will
be generating the algorithm for the above combined concepts and implementing these algorithms in benchmark
clinical datasets and provides efficient results and visualizing the results.
4. References
1. Aastha Joshi, Rajneet Kaur. A Review: Comparative
Study of Various Clustering Techniques in Data Mining.
International Journal of Advanced Research in Computer
Science and Software Engineering. 2013 Mar; 3(3):55-7.
2. Smyth P. Clustering using Monte Carlo cross-validation.
Learning, Probability, & Z Graphical Models. 1996; p. 12633.
3. Painthankar Rashmi, Tidke Bharat. A H-K clustering
algorithm for high dimensional data using ensemble learning. International Journal of Information Technology
Convergence and Services. 2014 Dec; 4(5/6):1-9.
Vol 9 (38) | October 2016 | www.indjst.org
4. Muller Emmanuel. Evaluating Clustering in subspace
projections of high dimensional Data. Proceedings of the
VLDB Endowment. 2009 Aug; 2(1):1270-81.
5. A novel approach for high dimensional data clustering.
Date Accessed: 9/01/2010: Available from: http://ieeexplore.ieee.org/document/5432636/.
6. Parsons Lance, Haque Ehtesham, Liu Huan. Subspace clustering for high dimensional Data: A Review. ACM SIGKDD
Explorations Newsletter. 2004 Jun; 6(1):90-105.
7. Strehl A, Ghosh J. Cluster ensembles – A knowledge reuse
framework for combining multiple partitions. Journal of
Machine Learning Research. 2003 Jan; 3:583-617.
8. He Ying , Wang Jian, Liang-Xi Qin, Mei Lin. A H-K
Clustering-algorithm for high dimensional data using
ensemble learning. IET International Conference on Smart
and Sustainable City 2013 (ICSSC 2013). 2013 Aug; p. 300–
305.
9. Jiawei Han, Kamber Michaline. Morgan Kaufmann
Publishers: Data Mining Concepts and Techniques,
3rd(Edn). 2011 Jul.
10. Sim K, Gopala Krishnan V, Zimek A, Kong G. A survey on
enhanced subspace clustering. Data mining and Knowledge
Discovery. 2013 Mar; 26(2):332-97.
11. Moise G, Zimek A, Knoger P, Kriegal HP, Sander J.
Subspace and Projected Clustering: Experiment Evaluation
and Analysis. Knowledge and Information Systems. 2009
Dec; 21:299-326.
Indian Journal of Science and Technology
5