Download Fast and Scalable Subspace Clustering of High Dimensional Data

Document related concepts

Human genetic clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Fast and Scalable Subspace Clustering
of High Dimensional Data
by
Amardeep Kaur
A thesis presented for the degree of
Doctor of Philosophy
School of Computer Science and Software Engineering
The University of Western Australia
Crawley, WA 6009, Australia
2016
Dedicated to my late mother
Abstract
Due to the availability of sophisticated data acquisition technologies, increasingly detailed
data is being captured through diverse sources. Such detailed data leads to a high number
of dimensions. A dimension represents a feature or an attribute of a data point. There is
an emergent need to find groups of similar data points called ‘clusters’ hidden in these
high-dimensional datasets. Most of the clustering algorithms which perform very well
with low dimensions, fail for the high number of dimensions. In this thesis, we focus our
interest in designing efficient solutions for the clustering problem in high-dimensional
data. In addition to finding similarity groups in high-dimensional data, there is an increasing interest in finding dissimilar data points as well. Such dissimilar data points are
called outliers. Outliers play an important role in the data cleaning process which is used
to improve the data quality. We aid the expensive data cleaning process by facilitating additional knowledge about the outliers in high-dimensional data. We find outliers in their
relevant sets of dimensions and rank them by the strength of their outlying behaviour.
We first study the properties of high-dimensional data and identify the reasons for
inefficiencies in the current clustering algorithms. We find that different combinations
of data dimensions reveal different possible groupings among the data. These possible
combinations or subsets of dimensions of the data are called subspaces. Each data point
represents measurements of a phenomenon over many dimensions. A dataset can be better understood by clustering it in its relevant subspaces and this process is called subspace
clustering. There is a growing demand for efficient and scalable subspace clustering solutions in many application domains like biology, computer vision, astronomy and social
i
networking. But the exponential growth in the number of subspaces with the data dimensions makes the whole process of subspace clustering computationally very expensive.
Some of the clustering algorithms look for a fixed number of clusters in pre-defined subspaces. Such algorithms diminish the whole idea of discovering previously unknown
and hidden clusters. We cannot have prior information of the relevant subspaces or the
number of clusters. The iterative process of combining lower-dimensional clusters into
higher-dimensional clusters in a bottom-up fashion is a promising subspace clustering approach. However, the performance of existing subspace clustering algorithms based on
this approach deteriorates with the increase in data dimensionality. Most of these algorithms require multiple database scans to generate an index structure for enumerating the
data points in multiple subspaces. Also, a large number of redundant subspace clusters
are generated, either implicitly or explicitly, during the clustering process.
We present SUBSCALE, a novel and an efficient clustering algorithm to find all hidden subspace clusters in the high-dimensional data with minimal cost and optimal quality.
Unlike other bottom-up subspace clustering algorithms, neither does our algorithm rely on
the step-by-step iterative process of joining lower-dimensional candidate clusters nor does
it selectively choose any user-defined subspace. Our algorithm directly steers toward the
higher dimensional clusters from one-dimensional clusters without the expensive process
of joining each and every intermediate clusters. Our algorithm is based on a novel idea
from number theory and effectively avoids the cumbersome enumeration of data points in
multiple subspaces. Moreover, the SUBSCALE algorithm requires only k database scans
for a k-dimensional dataset. Other salient features of the SUBSCALE algorithm are that
it does not generate any redundant clusters and is much more scalable as well as faster
than the existing state-of-the-art algorithms. Several relevant experiments were conducted
to compare the performance of our algorithm with the state-of-the-art algorithms and the
results are promising.
Although the SUBSCALE algorithm scales very well with the dimensionality of the
data, the only computational hurdle is the generation of one-dimensional candidate clus-
ii
ters. All of these one-dimensional clusters are required to be kept in the computer’s
working memory to be combined effectively. Because of this, random access memory
requirements are expected to grow substantially for the bigger datasets. Nonetheless, an
important property of the SUBSCALE algorithm is that the process of computing each
subspace cluster is independent of the others. This property helped us to improve the
SUBSCALE algorithm so that it can process the data to find subspace clusters even with
a limited working memory. The clustering computations can be split into any granularity
level so that one or more computation chunk can fit into the available working memory.
The scalable SUBSCALE algorithm can also be distributed across multiple computer systems with smaller processing capabilities for faster results. The scalability performance
was studied with upto 6144 dimensions where the recent subspace clustering algorithms
broke down for few tens of dimensions.
To speed up the clustering process for high-dimensional data, we also propose a parallel version of the subspace clustering algorithm. The parallel SUBSCALE algorithm is
based on shared-memory architecture and exploits the computational independence in the
structure of the SUBSCALE algorithm. We aim to leverage the computational power of
widely available multi-core processors and improve the runtime performance of the SUBSCALE algorithm. We parallelized the SUBSCALE algorithm and first experimented
with processing the candidate clusters from single dimensions in parallel. But in this implementation, there was an unavoidable requirement of mutual exclusive access to certain
portions of the working memory, which created a bottleneck in the performance of parallel algorithm. We modified the algorithm further to overcome this performance hindrance
and sliced the computations in a way that at any given time no two threads will try to
access the same block of memory. The experimental evaluation with upto 48 cores has
shown linear speed-up.
Although largely automatic collection of data has opened new frontiers for analysts to
gain knowledge insights, it has also introduced wide sources of error in the data. Hence,
the data quality problem is becoming increasingly exigent. The reliability of any data
iii
analysis depends upon the quality of the underlying data. It is well known that data
cleaning is a laborious and an expensive process. Data cleaning involves detecting and removing the abnormal values called outliers. The outlier identification becomes harder as
the data dimensionality increases. Similar to the clusters, outliers show their anomalous
behaviours in the locally relevant subspaces of the data and because of the exponential
search space of high-dimensional data, it is extremely challenging to detect outliers in all
possible subspaces. Moreover, a data point existing as an outlier in one subspace can exist
as a normal data point in another subspace. Therefore, it is important that when identifying an outlier, a characterisation of its outlierness is also given. These additional details
can aid a data analyst to make important decisions about whether an outlier should be
removed, fixed or left unchanged. We propose an effective outlier detection algorithm for
high-dimensional data as an extension of the SUBSCALE algorithm. We also provide an
effective methodology to rank outliers by strength of their outlying behaviour. Our outlier
detection and ranking algorithm does not make any assumptions about the underlying data
distribution and can adapt to different density parameter settings. We experimented with
different datasets and the top-ranked outliers were predicted with more than 82% precision and recall. A low or tighter density threshold reveals more data points as outliers
while a higher or loose density threshold allow more data points to be part of one or more
clusters, and therefore, lowers the overall ranking. With our outlier detection and ranking
algorithm, we aim to aid the data analysts with better characterisation of each outlier.
In this thesis, we endeavour to further the data mining research for high-dimensional
datasets by proposing various efficient as well as effective techniques to detect and handle
the similar and dissimilar data patterns.
iv
Acknowledgements
The PhD journey has been a learning experience for me, both at personal and professional
front. I would like to thank some of the many names who have helped me in various ways
to complete this thesis.
First and foremost, I would like to offer sincere gratitude to my principal supervisor
Professor Amitava Datta for his patience, encouragement and overall support. My writing
and research skills have considerably improved compared to where I stood at the start of
this PhD, mainly because of his positive and non-judgemental criticism along with continuous guidance. Thank you for sharing your wealth of knowledge and giving me this great
opportunity to learn. I am also grateful to my co-supervisor Associate Professor Chris
McDonald for his help in proof reading and providing useful feedback. While assisting
him in the university teaching activities, I learnt a lot by observing the thoughtfulness and
sheer hard-work he put for his students.
I acknowledge the financial and overall support received by the Australian Government through Endeavour Postgraduate Award. Their professional workshops and regular
contacts by the case managers have been invaluable. The supercomputing training by
Pawsey Supercomputing Centre was of immense help. I thank IBM SoftLayer for providing their server for research. I would also like to thank the anonymous reviewers whose
comments and feedback helped me improve my publications and subsequent thesis-work.
I offer my gratitude to the peaceful and serene university campus situated on the spiritual Noongar land. The Graduate Research School had many informative workshops and
seminars to support throughout my research journey. I am thankful for the technical and
v
administrative support available through my School of Computer Science and Software
Engineering. My heartfelt thanks to Dr. Anita Fourie from student support services for
being a good listener and a life-affirming pillar during those spaces plagued by a mix of
uncertainties.
The discussions with my lab colleagues Nasrin, Alvaro, Kwan, Mubashar and Noha
have been both a learning and a memorable experience. Special thanks to Noha for her
care and concern all this time. I am grateful for the lovely bunch of friends especially
Arshinder, Lakshmi, Feng and Darcy for their love and support. Many thanks to Catherine who was instrumental for the start of this journey. Also, to my lost friend Setu for
believing in me more than I believed in myself.
The biggest debt is of my adorable father, Jaswinder Singh Dua, whom I can never
repay for his unconditional love. I am thankful to him for letting me have my wings and
always standing by me, no matter what.
Lastly, my taste-buds cannot escape without thanking Connoisseur’s Cookies & Cream
ice-cream which was always there to fall back upon, whatever be the reason and the season.
vi
Publications
1. Kaur, A. & Datta, A. A novel algorithm for fast and scalable subspace clustering of
high-dimensional data. In : Journal of Big Data. 2, 17, p. 1-24, 2015
2. Kaur, A. & Datta, A. SUBSCALE: Fast and scalable subspace clustering for high
dimensional data. In: Proceedings IEEE International Conference on Data Mining Workshops, ICDMW. p. 621-628, 2014
vii
Contribution to thesis
My contribution to the thesis was 85%. I developed and implemented the idea, designed
the experiments, analysed the results and wrote the manuscript. My supervisor, Professor
Amitava Datta contributed for the underlying idea and played a pivotal role guiding and
supervising throughout, from initial conception to the final submission of this manuscript
viii
Contents
1
2
Introduction
1
1.1
Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2
Subspace clustering problem . . . . . . . . . . . . . . . . . . . . . . . .
5
1.2.1
Apriori principle . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.3
Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.4
Thesis organisaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Literature Review
10
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.2
Partitioning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.1
K-means and variants . . . . . . . . . . . . . . . . . . . . . . .
11
2.2.2
Projected clustering . . . . . . . . . . . . . . . . . . . . . . . . .
13
Non-partitioning algorithms . . . . . . . . . . . . . . . . . . . . . . . .
14
2.3.1
Full-dimensional based algorithms . . . . . . . . . . . . . . . . .
14
2.3.2
Subspace clustering . . . . . . . . . . . . . . . . . . . . . . . . .
16
Desirable properties of subspace clustering . . . . . . . . . . . . . . . .
21
2.3
2.4
3
A novel fast subspace clustering algorithm
24
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.1.1
Exponential search space . . . . . . . . . . . . . . . . . . . . . .
25
3.1.2
Redundant clusters . . . . . . . . . . . . . . . . . . . . . . . . .
26
3.1.3
Pruning and redundancy . . . . . . . . . . . . . . . . . . . . . .
27
ix
3.1.4
3.2
3.3
3.4
4
5
Multiple database scans and inter-cluster comparisons . . . . . .
28
Research design and methodology . . . . . . . . . . . . . . . . . . . . .
29
3.2.1
Definitions and problem . . . . . . . . . . . . . . . . . . . . . .
29
3.2.2
Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2.3
Assigning signatures to dense units . . . . . . . . . . . . . . . .
34
3.2.4
Interleaved dense units . . . . . . . . . . . . . . . . . . . . . . .
37
3.2.5
Generation of combinatorial subsets . . . . . . . . . . . . . . . .
38
3.2.6
SUBSCALE algorithm . . . . . . . . . . . . . . . . . . . . . . .
39
3.2.7
Removing redundant computation of dense units . . . . . . . . .
42
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.3.1
Methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.3.2
Execution time and quality . . . . . . . . . . . . . . . . . . . . .
49
3.3.3
Determining the input parameters . . . . . . . . . . . . . . . . .
56
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
Scalable subspace clustering
58
4.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.2
Memory bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.3
Collisions and the hash table . . . . . . . . . . . . . . . . . . . . . . . .
64
4.3.1
Splitting hash computations . . . . . . . . . . . . . . . . . . . .
66
4.4
Scalable SUBSCALE algorithm . . . . . . . . . . . . . . . . . . . . . .
68
4.5
Experiments and analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
Parallelization
77
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.2
Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
5.3
Parallel subspace clustering . . . . . . . . . . . . . . . . . . . . . . . . .
81
5.3.1
83
SUBSCALE algorithm . . . . . . . . . . . . . . . . . . . . . . .
x
5.3.2
5.4
6
87
Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.4.1
Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . .
91
5.4.2
Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.4.3
Speedup with multiple cores . . . . . . . . . . . . . . . . . . . .
93
5.4.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
Outlier Detection
99
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
6.2
Outliers and data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3
Current methods for outlier detection . . . . . . . . . . . . . . . . . . . . 104
6.4
7
Parallelization using OpenMP . . . . . . . . . . . . . . . . . . .
6.3.1
Full-dimensional based approaches . . . . . . . . . . . . . . . . 104
6.3.2
Subspace based approaches . . . . . . . . . . . . . . . . . . . . 105
Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.4.1
Anti-monotonicity of the data proximity . . . . . . . . . . . . . . 108
6.4.2
Minimal subspace of an outlier . . . . . . . . . . . . . . . . . . . 110
6.4.3
Maximal subspace shadow . . . . . . . . . . . . . . . . . . . . . 114
6.5
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.6
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Conclusion and future research directions
xi
123
List of Figures
1.1
Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Data grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
1.3
Bottom-up clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1
Data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2
Core and border data points in DBSCAN . . . . . . . . . . . . . . . . . .
15
3.1
Bottom-up clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.2
Projections of dense points . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.3
Projections of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.4
Matching dense units across dimensions . . . . . . . . . . . . . . . . . .
34
3.5
Numerical experiments for probability of collisions . . . . . . . . . . . .
36
3.6
Experiments with Erdos Lemma . . . . . . . . . . . . . . . . . . . . . .
37
3.7
Collisions among signatures . . . . . . . . . . . . . . . . . . . . . . . .
41
3.8
An example of sorted data points in a single dimension . . . . . . . . . .
42
3.9
An example of overlapping between consecutive core-sets of dense data
points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.10 An example of using pivot to remove redundant computations of dense
units from the core-sets . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.11 Effect of on runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.12 vs F1 measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.13 Runtime comparison for similar quality of clusters . . . . . . . . . . . .
52
xii
3.14 Runtime comparison for different quality of clusters . . . . . . . . . . . .
53
3.15 Runtime comparison between different subspace clustering algorithms for
fixed data size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
3.16 Runtime comparison between different subspace clustering algorithms for
fixed dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.17 Number of subspaces found vs runtime . . . . . . . . . . . . . . . . . . .
55
4.1
Number of clusters vs size of the dataset . . . . . . . . . . . . . . . . . .
59
4.2
Data sparsity with increase in the number of dimensions . . . . . . . . .
60
4.3
Internal structure of a signature node . . . . . . . . . . . . . . . . . . . .
62
4.4
Signature collisions in a hash table . . . . . . . . . . . . . . . . . . . . .
63
4.5
Illustration of splitting hT able computations . . . . . . . . . . . . . . . .
67
4.6
Runtime vs split factor for madelon dataset . . . . . . . . . . . . . . . .
74
5.1
Projections of dense points . . . . . . . . . . . . . . . . . . . . . . . . .
84
5.2
Structure of signature node . . . . . . . . . . . . . . . . . . . . . . . . .
85
5.3
hT able data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86
5.4
Allocating separate thread to each dimension . . . . . . . . . . . . . . .
89
5.5
Multiple threads for dimensions . . . . . . . . . . . . . . . . . . . . . .
94
5.6
Multiple threads for slices . . . . . . . . . . . . . . . . . . . . . . . . .
95
5.7
Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.8
Bell curve of signatures generated in each slice . . . . . . . . . . . . . .
97
5.9
Distribution of values in keys . . . . . . . . . . . . . . . . . . . . . . . .
98
6.1
Outlier in trivial subspace . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2
Outlier scores for shape dataset . . . . . . . . . . . . . . . . . . . . . . 118
6.3
Outlier scores for Parkinsons Disease dataset . . . . . . . . . . . . . . . 118
6.4
Outlier scores for Breast Cancer (Diagnostic) dataset . . . . . . . . . . . 120
6.5
Outlier scores for madelon dataset . . . . . . . . . . . . . . . . . . . . . 121
xiii
List of Tables
1.1
Data matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
3.1
Marks dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.2
Clusters in the Marks dataset . . . . . . . . . . . . . . . . . . . . . . . .
26
3.3
List of datasets used for evaluation . . . . . . . . . . . . . . . . . . . . .
49
4.1
Number of subspaces with increase in dimensions . . . . . . . . . . . . .
60
6.1
Outlier removal dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2
Evaluation of Parkinsons disease dataset . . . . . . . . . . . . . . . . . . 119
6.3
Evaluation of Breast Cancer dataset . . . . . . . . . . . . . . . . . . . . 119
xiv
Chapter 1
Introduction
With recent technological advancements, high-dimensional data are being captured in almost every conceivable area, ranging from astronomy to biological sciences. Thousands
of microarray data repositories have been created for gene expression investigation [1];
sophisticated cameras are becoming ubiquitous, generating a huge amount of visual data
for surveillance; the Square Kilometre Array Telescope is being built for astrophysics research and is expected to generate several petabytes of astronomical data every hour [2].
All of these datasets have more than hundreds or thousands of dimensions and the number of dimensions is increasing with better data capturing technologies day by day. The
dimensions of the dataset is also known as its attributes or features. The dimensionally
rich data poses significant research challenges for the data mining community [3, 4].
Clustering is one of the important data mining tasks to explore and gain useful information from the data [5]. Very often, it is desirable to identify natural structures of similar
data points, for example, customers with similar purchasing behaviour, genes with similar expression profiles, stars or galaxies with similar properties. Clustering can also be
seen as an extension of basic human nature to identify and categorize the things around.
Clustering is an unsupervised process to discover these hidden structures or groups called
clusters, based on similarity criteria and without any prior information of the underlying
data distribution.
1
2
Chapter 1. Introduction
Cluster
Figure 1.1: Clusters.
Figure 1.1 is a pictorial representation of grouping two-dimensional points into clusters. We notice that some of these points do not participate in any of the clusters.
To illustrate the clustering process in brief, consider an n × k dataset DB of k dimensions such that, each data point Pi is measured as a k-dimensional vector: Pi1 , Pi2 , . . . , Pik
where, Pid , 1 ≤ d ≤ k, is the value of a data point Pi in the dth dimension. We assume
the data in a metric space (Table 1.1). A cluster C is a set of points which are similar
based on a similarity threshold. Thus, points Pi and Pj participate in the same cluster if
sim(Pi , Pj ) = true.
Table 1.1: Data matrix
P1
P2
..
.
Pn−1
Pn
d1
P11
P21
d2
P12
P22
1
Pn−1
Pn1
2
Pn−1
Pn2
...
...
...
..
.
...
...
dk−1
P1k−1
P2k−1
dk
P1k
P2k
k−1
Pn−1
Pnk−1
k
Pn−1
Pnk
Similarity measure
A variety of distance measures can be used to quantify the similarity of the data points
[6–8]. Distance is one of the commonly used measures of similarity in metric data. The
shorter the distance between two data points, the more similar they will be. Lp -norm
3
Chapter 1. Introduction
calculates the distance between two k dimensional points Pi and Pj by comparing values
of their k dimensions (also called features) cf. Equation 1.1.
v
u k
uX
p
distance(Pi , Pj ) = Lp (Pi , Pj ) = t
(Pi − Pj )p
(1.1)
d=1
L1 and L2 are two important forms of Lp norm widely used in clustering cf. Equations
1.2 and 1.3 respectively. L1 is also called City block distance or Manhattan distance and
L2 is called Euclidean distance.
v
u k
uX
p
L1 (Pi , Pj ) = t
|Pi − Pj |
(1.2)
d=1
v
u k
uX
(Pi − Pj )2
L2 (Pi , Pj ) = t
(1.3)
d=1
Most of the clustering algorithms generate clusters by measuring proximity between
the data points through Lp distance and using either all or a subset of dimensions [9, 10].
Two points Pi and Pj belong to the same cluster if Lp (Pi , Pj ) ≤ threshold. The proximity threshold is decided by the user along with the density criterion. The density parameter
tells how many points should lie within a close neighbourhood in a data space so that this
region can be called a cluster. However, as the number of dimensions increases, the distance/density measurements fail to detect meaningful clusters due to a phenomenon called
the Curse of dimensionality and is discussed below.
1.1
Curse of dimensionality
Clustering high-dimensional data is difficult due to unique constraints imposed by large
number of dimensions, known as Curse of dimensionality - a term coined by Richard Bellman [11]. There are two implications of curse of dimensionality, first, on the similarity
measure and the other on the irrelevant attributes. According to Beyer et al. [12], as the
4
Chapter 1. Introduction
d2
P1
d3
d3
P1
P2
P1
P2
P2
d1
d4
d4
P1
d1
d2
P2
d4
P2
P2
P1
d1
d2
P1
d3
Figure 1.2: Data group together differently under different subsets of dimensions.
dimensionality of data grows, data points tend to become equally distant from each other
and thus, relative contrast among similar and dissimilar points becomes less.
The second implication is the presence of irrelevant dimensions in high-dimensional
datasets. Data tend to group together differently under different subsets of dimensions
(attributes) and not all dimensions are relevant together at a time.
For example, a 4-dimensional dataset can be projected onto a 2-dimensional space in
six different ways. Figure 1.1 shows possible relationships among two different points P1
and P2 under different subsets of dimensions. We notice that only one of the points P1 and
P2 participates in a cluster formation when projected on dimensions {d1 , d2 } and {d2 , d3 }
while both of them are part of either same or different clusters in dimensions {d1 , d4 } and
{d2 , d4 }. There is no cluster formation in dimensions {d3 , d4 } and both points stay out of
the cluster in dimensions {d1 , d3 }. To identify each of these relationships, we need to find
clusters with respect to particular relevant sets of dimensions. As a subset of dimensions is
called a subspace, these clusters existing in the subspaces of the data are called subspace
clusters. The data points in a subspace cluster are similar to each other in all dimensions
attached to this subspace.
5
Chapter 1. Introduction
Both of the above concerns of high-dimensional data implies that the useful clusters can only be found in lower-dimensional subspaces and all possible subspace clusters
should be discovered.
1.2
Subspace clustering problem
The subspace clustering is a branch of clustering which endeavours to find all hidden
subspace clusters. There is also an allied branch of clustering algorithms called projected
clustering where a user prescribes the number of subspace clusters to be found and each
data point can belong to atmost one cluster [13]. But this is more of a data partitioning
approach than an exhaustive search for hidden subspace clusters.
An important property of subspace clustering is that we do not have prior information
about the data points and dimensions participating in it. Thus, the only possible approach
is to perform an exhaustive search for similar data points in all possible subspaces. Moreover, the number of hidden clusters and the relevant subspaces should be an output rather
than an input of a clustering algorithm. A k-dimensional dataset can have upto 2d −1 axesparallel subspaces. The number of subspaces is exponential in their dimension, e.g. there
are 1023 subspaces for a 10-dimensional dataset and 1.04 million for a 20-dimensional
dataset. The large number of dimensions thus dramatically increases the possibilities of
grouping data points. Thus, the number of subspace clusters can far exceed the data size.
This exponential search makes subspace clustering a complex and challenging task.
Most of the subspace clustering algorithms use bottom-up search strategy based on
Apriori principle [14], which also helps to prune the redundant clusters.
1.2.1
Apriori principle
According to the Apriori principle, if a group of points form a cluster C in a d-dimensional
space then C is also a part of some cluster in the lower (d − 1)-dimensional projection
of this space. The downward closure property of this principle implies that cluster C will
6
Chapter 1. Introduction
be redundantly present in all 2d − 1 projections of this d-dimensional space. We call this
cluster C, a maximal cluster, which is intuitively a cluster in a subspace of maximum
possible dimensionality and it also means that this cluster cease to exist if we increase the
dimensionality of subspace even by one. It is not necessary to detect the non-maximal
clusters because they can be detected anyway as projections of maximal clusters. However, most of algorithms implicitly or explicitly compute these trivial clusters during the
clustering process. The second problem of excessive database scans arises as most algorithms construct clusters from dense units, smaller clusters that are occupied by a sufficient number of points. The database scans are required for determining the occupancy
of the dense units while constructing subspace clusters bottom up; to check whether the
same points occupy the next higher-dimensional dense unit while progressing from a
lower-dimensional dense unit.
Subspace clustering is a very complex and challenging task for the high-dimensional
data as the number of subspaces is exponential in dimensions. Most of the subspace
clustering algorithms use a bottom-up approach based on the downward closure property
of Apriori principle [15]. In this approach, density based similarity measures are used to
find the clusters in the lower-dimensional subspaces, starting from 1-dimensional clusters,
which are combined together iteratively to form the clusters in the higher-dimensional
subspaces (Figure 1.3). Although these algorithms can find arbitrary-shaped subspace
clusters, they fail to scale with the dimensions. The speed as well as the quality of clustering is of major concern [16].
1.3
Motivating examples
With the emergence of new applications, the area of subspace clustering is of critical
importance. Following are some of the examples which cannot be solved by the traditional
clustering algorithms due to their size, dimensionality and focus of interest:
7
Chapter 1. Introduction
Figure 1.3: Bottom-up clustering. Lower-dimensional clusters are joined with each other
to obtain higher-dimensional clusters.
– In biology, high throughput gene expression data obtained from microarray chips
forms a matrix [17]. Each cell in this matrix contains the expression level of a
gene (row) under an experimental condition (column). The genes which co-express
together under the subsets of experimental conditions are likely to be functionally
similar [18]. One of the interesting characteristics of this data is that both genes
(rows) and experimental conditions (columns) can be clustered for meaningful biological inferences. One of the assumptions in molecular biology is that only a subset
of genes are expressed under a subset of experimental conditions, for a particular
cellular process [19]. Also, a gene or an experimental condition can participate in
more than one cellular process allowing the existence of overlapping gene-clusters
in different subspaces. Cheng and Church [20] were the first to introduce biclustering which is just an extension of subspace clustering, for microarray datasets. Since
then, many subspace clustering algorithms have been designed for understanding
the cellular processes [21] and gene regulatory networks [22], assisting in disease
diagnosis [23] and thus, better medical treatments. Eren et al. [24] have recently
compared the performance of related subspace clustering algorithms in microarray data and because of the combinatorial nature of the solution space, subspace
clustering is still a challenge in this domain.
8
Chapter 1. Introduction
– Many computer vision problems are associated with matching images, scenes, or
motion dynamics in video sequences. Image data is very high-dimensional, e.g, a
low-end 3.1 megapixel camera can capture an 2048 × 1536 image of 3145728 dimensions. It has been shown that the solutions to these high-dimensional computer
vision problems lie in finding the structures of interest in the lower-dimensional
subspaces [25–27]. As a result subspace clustering is very important in many computer vision and image processing problems, e.g, recognition of faces and moving
objects. Face recognition is a challenging area as the images of the same object may
look entirely different under different illumination conditions and different images
can look the same under different illumination settings. However, Basri et al. [25]
have proved that all possible illumination conditions can be well approximated by
a 9-dimensional linear subspace, which has further directed the use of subspace
clustering in this area [28, 29]. Motion segmentation involves segregating each of
the moving objects in a video sequence and is very important for robotics, video
surveillance, action recognition etc. Assuming each moving object has its own trajectory in the video, the motion segmentation problem reduces to clustering the
trajectories of each of the objects [30], another subspace clustering problem.
– In online social networks, the detection of communities having similar interests can
aid both sociologists and target marketers [26]. Günnemann et al. [31] have applied
subspace clustering on social network graphs for community detection.
– In radio astronomy, clusters of galaxies can help cosmologists trace the mass distribution of the universe and further understand the origin of universe theories [32,33].
– Another important area of subspace clustering is web text mining through document clustering. There are billions of digital documents available today and each
document is a collection of many words or phrases, making it a high-dimensional
application domain. Document clustering is very important these days for efficient
indexing, storage and retrieval of the digital content. Documents can group together
9
Chapter 1. Introduction
differently under different sets of words. An iterative subspace clustering algorithm
for text mining has been proposed by Li et al. [34].
In all of the applications discussed above, meaningful knowledge is hidden in lowerdimensional subspaces of the data, which can only be explored through subspace clustering techniques. In this thesis, we look into this research challenge of finding subspace
clusters in high-dimensional data and propose efficient algorithms which are faster and
scalable in dimensions.
1.4
Thesis organisaton
The thesis contains 7 chapters.
Chapter 2 surveys background material on the problem of clustering, presenting several existing approaches to cluster data. It discusses the clustering techniques used to
tackle the high dimensionality, starting from trying to reduce the dimensionality to partitioning to subspace clustering.
Chapter 3 explains the foundation of our approach, which is termed SUBSCALE, our
novel algorithm for subspace of clustering high-dimensional data.
Chapter 4 introduces the approaches to make SUBSCALE a scalable algorithm for
bigger datasets both in terms of size and dimensions.
Chapter 5 discusses the parallel approaches to subspace clustering for faster execution.
Chapter 6 illustrates the applications of SUBSCALE in outlier characterisation and
ranking for high-dimensional data. It also presents a case study of using SUBSCALE on
a genes dataset.
Chapter 7 concludes the thesis and presents directions for future research.
Chapter 2
Literature Review
2.1
Introduction
In this chapter, we present the literature related to clustering, in particular, subspace clustering. We focus more on the algorithms related to our solution and discuss their advantages as well as disadvantages. We also discuss the opportunities provided by parallel
processing to increase the efficiency of clustering algorithms.
One of the fundamental endeavours to explore and understand the data is to find those
data points which are either similar or dissimilar. Classification and cluster analysis falls
into the category of similarity based grouping of data while outlier detection fits into the
latter.
Classification is a supervised approach to group the data into already known classes
or groups. Using a learning algorithm, predictions are made about which data point fits
into which class. A recent survey on the state-of-the-art classification algorithms is presented in [35]. Clustering or cluster analysis is an unsupervised way of grouping similar
data without any prior information about these groups [36]. Although clustering is more
challenging than classification, it helps to discover the hidden clusters which cannot be
known otherwise. The by-products of clustering are called outliers as these are the data
10
11
Chapter 2. Literature Review
points which do not fit into any group and can provide further insights in the underlying
data [37].
The history of cluster analysis can be traced back to 1950’s when one of the popular
clustering algorithm K-means was developed [38, 39]. The clustering problem has been
studied extensively in different disciplnes, including statistics [40], machine learning [41],
image processing [26], bioinformatics [42] and data mining [5]. In fact, a search with the
keyword ‘Data clustering’ on Google Scholar [43] found ∼ 3 million entries in year
2016. There are a number of surveys available on clustering algorithms along the timeline of their development [44–51].
Clustering algorithms can be broadly divided into two categories: partitioning (section
2.2) and non-partitioning (section 2.3). The partitioning algorithms like K-means [38],
K-medoids, PROCLUS [13] divide the n data points into K clusters using some greedy
approach to optimize the convergence criteria while the non-partitioning algorithms like
DBSCAN [10] and CLIQUE [15] attempt to find all possible clusters without any predefined number of clusters. While clustering, these algorithms use either all of the dimensions together [10] or use the measurements in some [13] or all of the subsets of
dimensions [15].
2.2
Partitioning algorithms
Partitioning algorithms iteratively relocate the data points from one cluster to another until
a convergence criterion is met. These are more of a data relocation technique to divide
the data into non-overlapping fixed number of regions (Figure 2.1).
2.2.1 K-means and variants
K-means is one of the oldest clustering algorithm to partition the n data points into K
non-overlapping clusters [38]. The K cluster centroids are initially selected at random
or using some heuristics. The data points are assigned to their nearest centroids using
12
Chapter 2. Literature Review
Original data
Partitioned data
Figure 2.1: Data partitioning.
Euclidean distance. The algorithm then recomputes the centroids of the newer distribution
of groups where a centroid is the mean of all the points belonging to that cluster. The data
points are iteratively relocated until the algorithm converges. The objective function like
minimum value of sum of the squared error is commonly used for convergence of the
K-means algorithm. The sum of squared error for all K clusters where Ci is an ith cluster
with µ as its centroid,
K X
X
||Pj − µi ||2
(2.1)
i=1 Pj ∈Ci
The complexity of the K-means algorithm is O(nkKT ) where n is the size of data, k
is the number of dimensions, K is the number of clusters and T is the number of iterations.
Although the K-means is very popular because of its simplicity and fast convergence, this
algorithm is very sensitive to the outliers as they can skew the location of centroids. Other
limitations include selection of K parameter and initial centroids, entrapment into local
optima, inability to deal with the clusters of arbitrary shape and size.
There have been many extensions to the K-means algorithm [39, 52]. For example,
the K-medoid or partitioning around medoids (PAM) algorithm [53] uses the median of
the data instead of their mean as centres of the clusters. As the median is less influenced
by the extreme values than the mean, PAM is more resilient in the presence of the outliers. But other limitations remain. The CLARANS algorithm [54] is an improvement
over K-medoid algorithm and is more effective for large datasets. The random samples
13
Chapter 2. Literature Review
of neighbours are taken from the data and graph-search methods are used to iteratively
obtain optimal K-medoids. However, the quadratic runtime of the CLARANS algorithm
is prohibitive on large datasets.
For high-dimensional data, K-means and its variants are unable to find clusters in the
subspaces.
2.2.2
Projected clustering
PROCLUS (PROjected CLUstering) [13] is a top-down projected clustering algorithm to
find K non-overlapping clusters, each represented with associated medoid and subspace.
The value of K and the average subspace size are given by the user.
The PROCLUS algorithm randomly chooses a set of K potential medoids on a sample
of points in the beginning. The iterative phase includes finding K good medoids, each
associated with its subspace. The subspace for each of these K medoids is determined
by minimizing the standard deviation of the distances of the points in the neighbourhood
of the medoids to the corresponding medoid along each dimension. The points are reassigned to the medoids considering the closest distance in the relevant subspace of each
medoid. Also, the points which are too far away from the medoids are removed as outliers.
The output is a set of partitions along with the outliers.
However, the user has to specify the number of clusters (K) as well as the number
of subspaces. If the value of K is too small then the PROCLUS algorithm may miss
out on some of the clusters entirely. Also, the PROCLUS algorithm can find clusters in
different subspaces but of same size which can miss out on clusters in other subspaces.
Additionally, the PROCLUS algorithm is biased toward clusters that are hyper-spherical
in shape.
The ORCLUS (ORiented projected CLUSter generation) [55] algorithm is similar
to the PROCLUS algorithm except that it finds clusters in non-axis parallel subspaces
by selecting principal components for each cluster instead of dimensions. The FINDIT
14
Chapter 2. Literature Review
algorithm [56] is a variant of the PROCLUS algorithm and improve its efficiency and
cluster quality using additional heuristics.
All of these projected clustering algorithms do not discover all possible clusters in the
data. Different groups of data can exhibit different clustering tendency under different
subsets of dimensions. Rather than being subspace clustering algorithms, these are essentially space-partitioning algorithms. Any attempt to choose the subspaces or their size
beforehand nullifies the idea of finding all possible unknown correlations among data.
2.3
Non-partitioning algorithms
The non-partitioning clustering algorithms does not depend on the user to input the number of clusters and the relevant subspaces (if any). The aim of a clustering algorithm is to
explore and identify the previously unknown clusters among the data without knowing the
underlying structure. Any attempt to pre-determine the number of clusters or subspaces
before the actual clustering process would dilute the whole idea of clustering.
The non-partitioning algorithms help to identify all possible hidden clusters in the
data without any user bias about the number of clusters or subspaces. These algorithms
are largely based on the density measures of the data and play a pivotal role in finding
arbitrary shaped clusters. The clusters are the dense regions separated by the sparse regions or regions of low density. There are two main categories of such algorithms: one is
based on full-dimensional similarity measures and the other measures similarity among
data points using relevant subset of dimensions.
2.3.1
Full-dimensional based algorithms
DBSCAN
DBSCAN [10] is a full dimensional clustering algorithm and does not need prior information about the number of clusters. According to the DBSCAN algorithm, a point is
dense if it has τ or more points within distance. A cluster is defined as a set of such
15
Chapter 2. Literature Review
Border Point
C
Core Point
B
A
D
Figure 2.2: Core and border data points in DBSCAN.
dense points with intersecting neighbourhoods. The clustering process is based on the
following five definitions:
Definition 1 (-neighbourhood). Given a database DB of n points in k-dimensions, the
-neighbourhood of a point Pi denoted by N (Pi ) is defined as:
N (Pi ) = {Pj ∈ DB, ∈ R|dist(Pi , Pj ) < },
(2.2)
dist() is a similarity function based on the distance between the values of the points.
Section 1 in previous chapter discusses some of the commonly used distance measures.
The cluster is defined by means of core data points as follows.
Definition 2 (Directly density-reachable). Based on another parameter τ , a cluster has
two kinds of points: core and border (Figure 2.2). If a point has atleast τ neighbours in
its -neighbourhood, it is called a core point and all of these points in the neighbourhood
are said to be directly density-reachable from it. A point is called a border point if it has
less than τ neighbours in the -neighbourhood.
A core point can never be directly density-reachable from a border point but a border
point can be a part of a cluster if it belongs to -neighbourhood of some core point. In
Figure 2.2, for example, A and B are core points and C is a border point. C is directlyreachable from B but B is not directly-density reachable from C. The direct density
reachability is not symmetric if both points are not core points.
16
Chapter 2. Literature Review
Definition 3 (Density-reachable). A point Py is density-reachable from a point Px if there
is a chain of points P1 , . . . , Pn such that P1 = Px ,Pn = Py and Pi+1 is directly densityreachable from Pi .
In Figure 2.2, the data point C is density-reachable from point A. A border point is
reachable from a core point but not vice versa. A border point can never be used to reach
other border points which might otherwise belong to the same cluster, for example points
C and D in Figure 2.2. In that case, if they share a common core point from which both
are density-reachable, then they both can be included in the cluster.
Definition 4 (Density-connected). Two points Px , Py are said to be density-connected
with each other, if there is a point Pz such that both Px and Py are density-reachable from
Pz . Both density reachability and connectivity is defined with respect to same and τ
parameters.
Definition 5 (Cluster).
A cluster consists of all density-connected points. If a point is density-reachable from a
point in the cluster then that point is included in the cluster as well.
The DBSCAN algorithm starts with an arbitrary point Px and if Px is a core point
then DBSCAN retrieves all density-reachable points and add them to the cluster. If Px
is a border point then next point is processed and so on. The DBSCAN algorithm is
not sensitive to the outliers and can find clusters of arbitrary sizes and shapes with a
complexity if O(n2 ). However, this algorithm uses all of the dimensions to measure the
-neighbourhood. As the data gets sparsely distributed in high-dimensional space, this
algorithm is unable to report meaningful clusters.
2.3.2
Subspace clustering
All of the clustering algorithms discussed above are time tested and known to perform
very well for low-dimensional data. But, these algorithms are not suitable for highdimensional data due to the curse of dimensionality. Also, they fail to give additional
17
Chapter 2. Literature Review
information related to clusters that are relevant dimensions in which these clusters are
more significant. Thus, it becomes imperative to find clusters hidden in lower-dimensional
subspaces.
Subspace clustering algorithms recursively find nested clusters using a bottom-up approach, starting with 1-dimensional clusters and merging the most similar pairs of clusters
successively to form a cluster hierarchy. A number of subspace clustering algorithms have
been proposed in the recent years. Agrawal et al. [15] were the first to introduce their famous CLIQUE algorithm for subspace clustering which is discussed below. We also discuss other subspace clustering algorithms: FIRES [57], SUBCLU [58], and INSCY [59],
which are largely based on the DBSCAN algorithm [10].
CLIQUE
The CLIQUE (CLustering In QUest) algorithm is based on the grid based computation to
discover clusters embedded in the subset of dimensions. The clusters in a k-dimensional
space are seen as hyper-rectangular regions of dense points iteratively built from lowerdimensional hyper-rectangular clusters.
The agglomerative cluster generation process in the CLIQUE algorithm is based on
the Apriori algorithm which was originally used for the frequent item-set mining [14]
and is discussed in chapter 1(section 1.2.1). According to the downward closure property
of the Apriori principle, if a set of points is a cluster in a k-dimensional space then this
set will be part of a cluster in the (k − 1)-dimensional space. The anti-monotonicity
property of this principle helps to drastically reduce the search space for iterative bottomup clustering process.
Initially, each single dimension of the data space is partitioned into equal-sized ξ units
using a fixed size grid. A unit is considered dense if the number of points in it exceeds the
density support threshold, τ . Only those units which are dense are retained and others are
discarded. The clustering process involves generation of k-dimensional candidate units
by self-joining those (k − 1)-dimensional units which share first k − 2 dimensions in
18
Chapter 2. Literature Review
common, assuming that dimensions attached to each dense units are in sorted order. At
each step, the candidate units which are not dense are discarded and the rest are processed
to generate higher dimensional candidate units.
Thus, 1-dimensional base units in k single dimensions are combined using self-join to
form 2-dimensional candidate units and out of these 2-dimensional units, non-dense units
are discarded and the rest are combined to form 3-dimensional candidate units and so on.
Finally, at each k th subspace, the clusters are formed by computing the disjoint sets of
connected k-dimensional units. At the end of this recursive clustering process, we have
a set of clusters in their highest possible subspaces. These clusters can lie in the same,
overlapping or disjoint subspaces.
The CLIQUE algorithm is insensitive to the outliers and can find arbitrary shaped
clusters of varying sizes. Most importantly, for each cluster, additional information about
the relevant subset of dimensions is also given. The time complexity of the CLIQUE
algorithm is O(cp + pn) where c is a constant, p is the dimensionality of highest subspace
found and n is the number of input data points. The complexity grows exponentially with
dimensions.
The main inefficiency of the CLIQUE algorithm comes from generation of large number of redundant dense units during the process. There is no escape from computation of
these redundant units as they have to be generated at each of the 1st , 2nd , . . . , (k − 1)th
dimensional subspaces, before a maximal cluster at k-dimensional subspace is found. The
maximal subspace clusters were introduced in section 1.2.1. Although these dense units
are pruned as the algorithm progresses in higher dimensions, it is the first few lowerdimensional subspaces which generate larger shares of these dense units. For example, a
k-dimensional data would have k × (k − 1) 2-dimensional subspaces. As each dimension
is divided into ξ units, each 2-dimensional subspace will have to self join ξ 2 units. In
total, there will be k × (k − 1) × ξ 2 units to be self-joined. The self join further adds on
to the time complexity by comparing and checking each and every point in the adjacent
units.
19
Chapter 2. Literature Review
The computational expense of generating and combining dense units at each stage of
the recursive process causes the CLIQUE algorithm to break down for high-dimensional
data.
CLIQUE extensions
The MAFIA (Merging of Adaptive Finite IntervAls) [60] algorithm proposed improvement over the CLIQUE algorithm through better cluster quality and efficiency. It introduced adaptive grids which are semi-automatically built based on the data distribution
and it uses the same bottom up cluster generation process starting from 1-dimension. Although, MAFIA yields upto two orders of magnitude speed-up as compared to CLIQUE,
the execution time of MAFIA grows exponentially with the dimensionality of data.
ENCLUS (ENtropy based CLUStering) [61] is another algorithm similar to the CLIQUE
algorithm but uses the concept of entropy from information theory to find the relevant
subspaces for clustering. The underlying premise is that a uniform distribution of data
will have a higher entropy than the skewed data distribution. Therefore, the entropy of
subspaces having regions of dense units will be low. Based on an entropy threshold, subspaces are selected for clustering. Entropy also helps to prune the subspaces similar to
downward closure property of Apriori principle. If a k dimensional subspace has a lower
entropy, then (k − 1) dimensional subspace will also have a lower entropy.
The benefit of using entropy is that the ENCLUS algorithm can find extremely dense
and small clusters which were otherwise ignored by the CLIQUE algorithm. Yet, the additional cost of finding entropy of each and every subspace makes this algorithm infeasible
for high-dimensional data.
SUBCLU
The SUBCLU [58] algorithm relies on DBSCAN to detect clusters in each of the subspaces. Similar to the previous bottom-up clustering approaches, it uses Apriori principle
to prune through the subspaces, and also generates all lower-dimensional trivial clusters.
20
Chapter 2. Literature Review
FIRES
Kriegel et al. proposed FIRES (FIlter REfinement Subspace clustering) [57] which is
a hybrid algorithm to find approximate subspace clusters directly from 1-dimensional
clusters. Although it uses a bottom-up search strategy to find maximal cluster approximations, it does not incorporate step-by-step Apriori style. The FIRES algorithm consists
of three phases: pre-clustering, generation of subspace cluster approximations and postprocessing of subspace clusters.
During the preprocessing phase, FIRES computes 1-dimensional clusters called base
clusters and any clustering technique like DBSCAN, K-means or others can be used
to generate these base clusters. The smaller clusters are discarded in this phase. In the
second phase, the ‘promising’ candidates from the 1-dimensional base clusters are chosen
based on the similarity among them. FIRES defines similarity of clusters by the number
of intersecting points and heuristics are used to select the most similar base clusters. The
resulting clusters represent hyper-rectangular approximations of the subspace clusters. In
the post-processing step, the structures of these approximations are further refined.
FIRES does not employ the exhaustive search procedure to find all possible subspace
clusters and therefore, outperforms SUBCLU and CLIQUE in terms of scalability and
runtime with respect to data dimensionality. However, this performance boost is compensated by the cost incurred due to the loss of clustering accuracy. FIRES does not discover
all of the hidden subspace clusters and only gives heuristic approximations of subspace
clusters which may or may not overlap.
INSCY
Assent et al. proposed INSCY algorithm [59] for the subspace clustering which is an
extension of the SUBCLU algorithm. They use a special index structure called a SCYtree which can be traversed in the depth first order to generate high-dimensional clusters.
Their algorithm compares each data point of the base cluster and enumerates them implicitly in order to merge the base clusters for generating the higher-dimensional clusters.
21
Chapter 2. Literature Review
The search for the maximal subspace clusters by the INSCY algorithm is quite exhaustive
as it implicitly generate all intermediate trivial clusters during the bottom up clustering
process. The complexity of the INSCY algorithm is O(2k |DB|2 ), where k is the dimensionality of the maximal subspace cluster and |DB| denotes the size of the dataset. Also,
Muller et al. [62] proposed an approach for subspace clustering which reduces the exponential search space while generating intermediate clusters through selective jumps. But
again their algorithm depends upon counting the points across candidate hyper-rectangles
to determine their similarity and preference.
2.4
Desirable properties of subspace clustering
We have identified the following desirable properties which should be satisfied by a subspace clustering algorithm for a k-dimensional dataset of n points:
1. The groupings among data points vary under different subsets of dimensions. Although the clusters within the same subspace are disjoint, the clusters from different subspaces can be partially-overlapping and share some of the data points among
them. Therefore, a subspace clustering algorithm should extract all possible clusters
in which a data point participates. For example, if a cluster C in a subspace {1, 3, 4}
contains points {P3 , P6 , P7 , P8 } and another cluster C 0 in a subspace {1, 3, 6} contains points {P1 , P3 , P4 , P6 }, both of the clusters C and C 0 should be detected. Note
that both points P3 and P6 are participating together in two different clusters in different subspaces.
2. The subspace clustering algorithm should give only non-redundant information,
that is, if all the points in a cluster C are also present in a cluster C 0 and the subspace
in which C exists is a subset of the subspace in which the cluster C 0 exists, then the
cluster C should not be included in the result, as the cluster C does not give any
additional information and is a trivial cluster.
22
Chapter 2. Literature Review
A strong conformity to this criterion would be that such redundant lower-dimensional
clusters are not generated at all, as their generation and pruning later on leads to the
higher computational cost. In other words, the subspace clustering algorithm should
output only the maximal subspace clusters. As discussed earlier, a cluster is in a
maximal subspace if there is no other cluster which conveys the same grouping information between the points as already given by this cluster. The cluster C 0 is thus,
a maximal cluster while cluster C is a non-maximal or trivial cluster.
The K-means based partitioning algorithms are meant to only find a predefined number of clusters using full-dimensional distance among the data points. The clusters existing in the subspaces of high-dimensional data cannot be discovered using these techniques. Therefore, both of the desirable criteria for efficient subspace clustering cannot
be applied to these algorithms. The projected clustering algorithms like PROCLUS can
find clusters in the subspaces but fail to detect all maximal clusters and do not conform
to the 2nd criterion of desirable properties described above. Neither do these algorithms
satisfy the 1st criterion, as only a user defined number of clusters is detected.
The non-partitioning clustering algorithms like DBSCAN which are based on fulldimensional space, does not fall under the category of subspace clustering and thus, both
criteria on desirable properties can be skipped from the discussion. The hierarchical clustering based algorithms like CLIQUE and SUBCLUE satisfy the 1st criterion of the desired subspace clustering algorithm and can find all of the arbitrary shaped clusters, but
they fail to satisfy the 2nd criterion as they still generate many trivial clusters. INSCY
algorithm too cannot strongly conform to the 2nd criterion of the desired subspace clustering algorithm. FIRES algorithm fails to satisfy both of the criteria as it does not output
all possible clusters and also generate redundant clusters along the process.
It is no doubt that subspace clustering is an expensive process. Due to the numerous
applications of subspace clustering as discussed in previous chapter, there is an urgent
need for efficient solutions to the subspace clustering problem. Exploring all of the subspaces for possible clusters is a challenge. The need for enumerating points in O(2k )
23
Chapter 2. Literature Review
subspaces using the multi-dimensional index structure introduces the computational cost
as well as the inefficiency. All of these subspace clustering algorithms discussed so far
suffer from the lack of the efficient indexing structures for the enumeration of the points
in the multi-dimensional subspaces and also, require multiple database scans. The generation of trivial clusters adds to the complexity.
The optimal solution to subspace clustering problem is to generate only maximal clusters with minimal database scans. In the next chapter, we overcome the limitations of
existing clustering algorithms by proposing a novel approach to efficiently find all possible maximal subspace clusters in high-dimensional data. Our approach fully conforms to
both of the desirable criteria of a true subspace clustering algorithm.
Chapter 3
A novel fast subspace clustering
algorithm
3.1
Introduction
High-dimensional data poses its own unique challenges for clustering. The baseline fact
behind these challenges is that the data group together differently under different subsets
of dimensions. For better insight into the underlying data, it is important to know the
relevant dimensions associated with each group of similar points called cluster. Subspace
clustering algorithms are the key to discover such inter-relationships between clusters and
the subsets of dimensions called subspaces.
As we do not have any prior information about hidden clusters and the relevant subset
of dimensions, an exhaustive search of all subspaces seems necessary. Subspace clustering through a bottom-up hierarchical process promises to find all possible subspace
clusters. We have discussed some of these state-of-the-art algorithms in chapter 2.
However, the exponential increase in number of subspaces with the dimensions, makes
the subspace clustering process extremely expensive. As we discussed in chapter 2, there
are some pruning techniques employed by various subspace clustering algorithms to reduce this search space but redundant clusters are still generated at each stage of the hier-
24
25
Chapter 3. A novel fast subspace clustering algorithm
Table 3.1: Marks dataset
Student id
S1
S2
S3
S4
S5
mathematics
10
9.6
4
1.6
1.5
science
8
7.6
7.8
7.7
9
arts
2
8
2.2
2.3
5.2
archical process. Although these clusters are eliminated later on, their generation during
the clustering process adds to the computational expense. Also, merging of dense units
using self-join and other point-wise matching and comparing techniques brings in further
inefficiency.
In this chapter, we present our novel solution to subspace clustering problem for highdimensional data. Before explaining our approach, we revisit the subspace clustering
problem using examples.
3.1.1
Exponential search space
In Table 3.1, a dummy Marks dataset of 5 students consisting of their examination marks
measured over three subjects (dimensions): mathematics, science and arts is shown. It
might be interesting to find which groups of students perform similarly in which of the
exams. Two students might perform similarly in mathematics and science but not in arts.
Some other students might score similar marks in all three of mathematics, science and
arts. If we assume a similarity distance of 0.5, as shown in Table 3.2, there is one cluster
each in the subspaces: {mathematics, science} and {science, arts}. Also, no two students
have similar marks within the range of 0.5 distance in the subspace: {mathematics, arts}.
With just three attributes in the above example, there are 23 − 1 possible ways to
decipher the relevant subspaces of similar points. As the number of dimensions grows
from three to hundreds or thousands or higher, there is an exponential growth of possible
26
Chapter 3. A novel fast subspace clustering algorithm
Table 3.2: Clusters in the Marks dataset
subspace
{mathematics}
{science}
{arts}
{mathematics, science}
{mathematics, arts}
{science, arts}
{mathematics, science, arts}
clusters
{S1, S2} and {S4, S5}
{S1,S2, S3, S4}
{S1, S3, S4}
{S1, S2}
nil
{S1, S3, S4}
nil
subspaces which can contain clusters. For efficient clustering, it is important to reduce
this search space without any information loss.
3.1.2
Redundant clusters
We note in Table 3.2, a student can participate in more than one cluster in different subspaces (for example student S1). Thus, the clusters from different subspaces can overlap.
The overlapping of clusters can also happen within the same subspace but they can be connected together through common points to get maximal coverage as proposed in CLIQUE
or DBSCAN algorithms. For example, if students S1 and S2 received similar marks in
mathematics and S1 and S3 also received similar marks in mathematics, then we can say
that, all three students S1, S2, and S3 scored similar marks in mathematics.
Two overlapping clusters from different subspaces represent relationships among points
under different circumstances. Sometimes it might be feasible to combine two overlapping clusters from different subspaces. For example, if students {S1, S2} score similar in
mathematics and students {S1, S3} score similar in science, then it cannot be predicted
that S3 also scored similar to S1 in mathematics or S2 scored similar to S1 in science.
But if the attached subspaces of two clusters form a hierarchical relationship then there
are situations when one of them can be eliminated. For example, in Table 3.2, there are
two disjoint groups of students who score similar in mathematics: {S1, S2} and {S4,
27
Chapter 3. A novel fast subspace clustering algorithm
S5}. The group {S1, S2} also score similar in the subspace: {mathematics, science}.
As the group {S1, S2} is redundantly present in both subspaces and one of them can be
discarded.
The number of redundant clusters grows tremendously with the increase in the number
of subspaces. The pruning techniques like Apriori algorithm helps to reduce the number
of such overlapping clusters by eliminating the ones present in lower-dimensional subsets
of a subspace.
3.1.3
Pruning and redundancy
According to anti-monotonicity property of the Apriori principle, if a set of points forms
a cluster in a k-dimensional subspace S then it will be part of a cluster in the (k − 1)dimensional subspace S 0 such that S 0 ⊂ S. Thus, a cluster from a higher-dimensional
subspace S will be projected as a cluster in all of the 2k − 1 lower-dimensional subspaces.
Considering Figure 3.1, suppose there is a cluster C present in a 3-dimensional subspace
{1, 3, 4} then it will also be present in all the subsets of this subspace, {1, 3}, {1, 4}, {3, 4}
, {1}, {2}, and {3}. Let there be no superset of subspace {1, 3, 4} which contains cluster C which means C is a maximal subspace. The cluster C in the maximal subspace
{1, 3, 4} gives the same grouping information as provided by the combined subspaces in
the subsets of {1, 3, 4}. It is therefore sufficient to find clusters in only their maximal
subspaces. It is not necessary to generate non-maximal clusters because they are trivial,
but most of the algorithms implicitly or explicitly compute them.
The subspace clustering algorithms based on the bottom-up approach find the maximal
subspace clusters using step-by-step hierarchical cluster building process. Starting from
1-dimensional subspace, the clusters in (k − 1)-dimensional subspace are combined to
generate candidate clusters in the k-dimensional subspace. Then non-dense candidates are
discarded in the k-dimensional subspace and rest of the dense clusters are again combined
together to find (k + 1)-dimensional candidate clusters and so on. Only after finding the
k-dimensional clusters, (k − 1)-dimensional clusters can be eliminated, but not before
28
Chapter 3. A novel fast subspace clustering algorithm
Figure 3.1: Typical iterative bottom up generation of clusters based on the Aprioriprinciple. Dense points in 1-dimensional subspace are combined to compute twodimensional clusters which are then combined to compute three-dimensional clusters and
so on.
that. A large number of these redundant clusters are generated for higher-dimensions and
would have already added to the runtime cost before their actual elimination starts.
3.1.4
Multiple database scans and inter-cluster comparisons
In addition to mandatory detection of the redundant non-maximal clusters, another inherent problem of step-by-step bottom up clustering algorithms is multiple database scans.
An initial database scan is required to generate the 1-dimensional dense clusters. Then,
while generating k-dimensional clusters, another database scan is required at each stage
to check the occupancy of each candidate and eliminate the non-dense candidates. Along
with these database scans, another inefficiency comes from the need to compare each
k − 1-dimensional cluster with all of the other k − 1 dimensional clusters during merging
phase. The comparison between two set of data points checks for each and every point in
both clusters and merge them accordingly. The number of clusters in the merged pool is
much larger in lower-dimensional subspaces than in higher-dimensions.
29
Chapter 3. A novel fast subspace clustering algorithm
A k-dimensional subspace has 2k lower-dimensional subspaces and the clusters at
each of these subspaces need to be compared with each other to generate next higherdimensional candidate. The repeated occupancy check for density and large number of
inter-cluster comparisons increases both time and space complexity. The inefficiency
increases drastically with the increase in dimensions.
In this chapter, we present a novel algorithm called SUBSCALE which tackles all of
the challenges faced by the current subspace clustering algorithms much more efficiently.
Our algorithm eliminates the need to generate and process redundant subspace clusters,
does not require multiple database scans, and above all provides a new technique to compare dense sets of points across dimensions. The SUBSCALE algorithm is far more
scalable with the dimensions as compared to the existing algorithms and is explained in
details in the next section.
3.2
Research design and methodology
Continuing with the monotonicity of the Apriori principle, a set of dense points in a kdimensional space S is dense in all lower-dimensional projections of this space [15]. In
other words, if we have the dense sets of points in each of the 1-dimensional projections
of the attribute-set of a given data, then the sufficiently common points among these 1dimensional sets will lead us to the dense points in higher-dimensional subspaces.
Based on this premise, we develop our algorithm to efficiently find the maximal clusters in all possible subspaces of a high-dimensional dataset. Before we explain our novel
idea to find subspace clusters, we would like to formally define the problem space.
3.2.1
Definitions and problem
Let DB = {P1 , P2 , . . . , Pn } be a database of n points in a k-dimensional space. The
k dimensions are represented by an attribute-set A : {d1 , d2 , . . . , dk }. Each point Pi in
the database DB is a k-dimensional vector, {Pi1 , Pi2 , . . . , Pik } such that, Pid , 1 ≤ d ≤ k
30
Chapter 3. A novel fast subspace clustering algorithm
denotes the value measured for a point Pi in the dth dimension. Pid is also called the
projection of point Pi in the dth dimension. The database DB can also be seen as an
n × k matrix.
Subspace
A subspace S is a subset of the dimensions from the attribute-set A : {d1 , d2 , . . . , dk }.
For example, S : {dr , ds } is a 2-dimensional subspace consisting of dimension dr and ds
and the projection of a point Pi in this subspace is {Pidr , Pids }.
For the sake of simplicity, we will use only subscript to denote a dimension, therefore,
the subspace S in this case becomes {r, s}. Also, we use the term ‘c-D’ to represent any
c-dimensional point or group of points, for example, 2-D means a two dimensional point
or group of points.
The dimensionality of a subspace refers to the total number of dimensions in it. A
single dimension can be referred to as a 1-dimensional or 1-D subspace. A subspace with
a dimensionality a is a higher-dimensional subspace compared to another subspace with
a dimensionality b, if a > b. Also, a subspace S 0 with dimensionality b is a projection of
another subspace S of dimensionality a, if a > b and S 0 ⊂ S, that is, all the dimensions
participating in S 0 are also contained in the subspace S.
Density concepts
We adopt the definition of density from DBSCAN [10] which is based on two user defined parameters and τ , such that, a point is dense if it has at least τ points within its
neighbourhood of distance. The connectivity among the dense points is used to identify
arbitrary shaped clusters.
We refer to section 2.3.1 in chapter 2 for the formal definitions of density based connectivity between points. These dense points can be easily connected to form a subspace
cluster.
31
Chapter 3. A novel fast subspace clustering algorithm
Definition 6 (Maximal subspace cluster).
A subspace cluster, C = (P, S) is a group of dense points P in a subspace S, such that
∀Pi , Pj ∈ P , Pi and Pj are density connected with each other in the subspace S with
respect to and τ and there is no other point Pr ∈ P , such that Pr is density-reachable
from some Pq ∈
/ P in the subspace S. A cluster Ci = (P, S) is called a maximal subspace
cluster if there is no other cluster Cj = (P, S 0 ) such that S 0 ⊃ S.
The maximality of a particular subspace is always relative to a cluster. A subspace
which is maximal for a certain group of points might not be maximal for another group
of points. For example, a cluster C1 : {P1 , P2 , P3 , P4 } might exist in a subspace S :
{d1 , d2 , d4 , d6 } such that there is no superset of S which contains C1 . Thus, S is a maximal
subspace for C1 . While another cluster C2 : {P1 , P5 , P6 , P7 } might exists in a maximal
subspace S 0 : {d1 , d2 }. Here, subspace S 0 is maximal for cluster C2 . Although subspace S 0 is a subset of subspace S, both subspaces are relevant and maximal for different
clusters.
Some of the related literature treat the maximality of the clusters in terms of the inclusion of all possible density-connected points (in a given subspace) into one cluster. We
call it an inclusive property of clustering algorithm. Our ‘maximal subspace clusters’ are
both inclusive (with respect to the points in a given subspace) and maximal (with respect
to lower-dimensional projections).
In the next subsection, we explain the main ideas underlying the SUBSCALE algorithm.
3.2.2
Basic idea
Consider an example given in Figure 3.2, the two-dimensional Cluster5 is an intersection
of its 1-D projections of points in Cluster1 and Cluster2 . Also, we note that the projections of the points {P7 , P8 , P9 , P10 } on d2 -axis form a 1-D cluster (Cluster3 ), but there is
no 1-D cluster in the dimension d1 which has the equivalent points in it, which justify the
absence of a two-dimensional cluster containing these points in the subspace {d1 , d2 }.
32
Chapter 3. A novel fast subspace clustering algorithm
P1
P2
P3
P4
Dimension d2
P5
P7
P9
P11
P12
P13
P14
Dimension d1
Figure 3.2: Basic idea behind the SUBSCALE Algorithm. The projections of the points
{P7 , P8 , P9 , P10 } on d2 -axis form a 1-D cluster, that is, Cluster3 , but no 1-D cluster in
dimension d1 have the same points as Cluster3 , therefore, absence of corresponding 2dimensional cluster containing these points in the subspace {d1 , d2 }.
Given an m-dimensional cluster C = (P, S) where, S = {d1 , d2 , . . . dm }, the projections of the points in P are dense points in each of the single dimensions, {d1 }, {d2 }, . . .
, {dm }. It implies that if a point is not dense in a 1-dimensional space then it will not participate in the cluster formation in higher subspaces containing that dimension. Thus, we
can combine the 1-D dense points in m different dimensions to find the density-connected
sets of points in the maximal subspace S. Recall that a point is dense if it has at least τ
neighbours in neighbourhood with respect to a distance function dist(). In 1-dimensional
subspaces, the L1 metric can be safely used as a distance function to find the dense points.
Observation 1. If at least τ +1 density-connected points from a dimension di also exist as
density-connected points in the single dimensions dj , . . . , dr , then these points will form
a set of dense points in the maximal subspace, S = {di , dj , . . . dr }.
To illustrate further, let there be four clusters: red, green, blue, purple in higherdimensional subspaces. These four clusters can be in the same subspace or may be in
different subspaces of different dimensionality. Assume these subspaces are maximal for
33
Chapter 3. A novel fast subspace clustering algorithm
di
dj
dk
Figure 3.3: Projections of clusters in a high-dimensional subspace are dense across participating dimensions.
these clusters and contain at least three dimensions di , dj and dk . As discussed before,
all of these three dimensions will have the projections of these four clusters as shown in
Figure 3.3.
An important observation in Figure 3.3 is that the dense projections in the single dimensions can exist as intermixed with other neighbouring dense points. In dimensions dk
for example, the points of green cluster are mixed with points from another pink cluster
and the pink cluster does not exist in other two dimensions. The challenge is how to connect these dense points from different 1-dimensional spaces to form a maximal subspace
cluster.
The naive way is to first find the density-connected points in each dimension and then
find intersections of all of the density-connected points in all of the single dimensions.
Each density-connected set can have different number of points in it and there can be
different number of density-connected sets in each dimension. Comparing each and every point across dimensions is not an efficient way for high-dimensional data. Another
approach can be to divide these density-connected points into smaller units and instead
of comparing each and every point of density-connected sets, simply compare these units
across dimensions to check if they contain identical points in them as shown in Figure
3.4.
Following Definition 3, each point in a subspace cluster will belong to the neighbourhood of at least one dense point. Therefore, the smallest possible projection of a cluster
from higher-dimensional subspace is of cardinality τ + 1 and let us call it a dense unit,
d
U. If U1di and U2 j are two 1-D dense units from the single-dimensions di and dj respec-
34
Chapter 3. A novel fast subspace clustering algorithm
di
dj
dk
Figure 3.4: Matching dense units across dimensions.
tively, we say that these are the same dense units if they contain the same points, that is,
d
d
U1di = U2 j , if ∀Pi [Pi ∈ U1di ↔ Pi ∈ U2 j ].
Observation 2. Following the observation 1, if the same dense unit U exists across m
single dimensions, then U exists in a maximal subspace spanned by these m dimensions.
In order to check if two dense units are same, we propose a novel idea of assigning
signatures to each of these 1-D dense units. The rationale behind this is to avoid comparing the individual points among all dense units in order to decide whether they contain
exactly the same points or not. We can hash the signatures of these 1-D dense units from
all k dimensions and the resulting collisions will lead us to the maximal subspace dense
units (Observation 2).
Our proposal for assigning signatures to the dense units is inspired by the work in
number theory by Erdös and Lehner [63] which we explain in detail below.
3.2.3
Assigning signatures to dense units
If L ≥ 1 is a positive integer, then a set {a1 , a2 , . . . , aδ } is called its partition, such that
P
L = δi=1 ai for some δ ≥ 1 and ai > 0 is called a summand. Also, let pδ (L) be the total
number of such partitions, when each partition has at most δ summands.
Erdös and Lehner [63] studied these integer partitions by probabilistic methods and
1
gave an asymptotic formula for δ = o(L 3 ),
L−1
δ−1
pδ (L) ∼
δ!
(3.1)
35
Chapter 3. A novel fast subspace clustering algorithm
Observation 3. Assume K be a set of random large integers, δ |K| pδ (L). Let U1
1
and U2 be the two sets of integers drawn from K s.t. |U1 | = |U2 | = δ and δ = o(L 3 ). Let
us denote the sums of the integers in these two sets as sum(U1 ) and sum(U2 ) respectively.
We observe that if sum(U1 ) = sum(U2 ) = L, then U1 and U2 are same with an extremely
high probability, if L is very large.
Proof. From Equation 3.1, for a very large positive integer L, if we take relatively very
small partition size δ, then the number of unique fixed-sized partitions will be astronomically large. And the probability of getting a particular partition set of size δ is:
L−1
δ−1
L
(L − 1)!δ!(L − δ)!
1
/
=
=
δ!
(δ − 1)!(L − δ)!δ!L!
L(δ − 1)!
δ
(3.2)
It means the probability of randomly choosing the same partition again is extremely
low. And this probability can be made very small by choosing a large value of L and
relatively very small δ. Since L is the sum of the labels of δ points in a dense unit U, L
can be made very large if we choose very large integers as the individual labels. Thus,
with δ = τ + 1, the two dense units U1 and U2 will contain the same points with very high
probability, if sum(U1 ) = sum(U2 ), provided this sum is very large.
We randomly generate a set K of n large integers and use a one-to-one mapping
M : DB 7→ K to assign a unique label to each point in the database. The signature
Sig of a dense unit U is given by the sum of the labels of the points in it. Thus, relying
on observation 3, we can match these 1-D signatures across different dimensions without
dm
checking for the individual points contained in these dense units, e.g., if U1d1 , U2d2 , . . . Um
are m dense units in m different single dimensions, with their points already mapped to
the large integers, we can hash their signature-sums to a hash table. If all the sums collide
then these dense units are same (with very high probability) and exist in the subspace
{d1 , d2 , . . . , dm }. Thus, the final collisions after hashing all dense units in all dimensions
generate dense units in the relevant maximal subspaces. We can combine these dense
units to get final clusters in their respective subspaces.
36
Chapter 3. A novel fast subspace clustering algorithm
100,000,000
102 trials
103 trials
104 trials
105 trials
106 trials
107 trials
No. of Collisions
1,000,000
10,000
100
1
0
1
2
3
4
5
6
7
8
9
10
11
12
No. of digits used for labels
Figure 3.5: A trial includes drawing a label set of 4 random integers with the same number
of digits e.g., {333, 444, 555, 666} is a sample set where no. of digits is 3. The probability
of drawing same set of integers reduces drastically with the use of larger integers as labels
in the set, e.g, ≈ 10 collisions for 100 million trials when the label is a 12-digit integer.
Experiments with large integers
We did a few numerical experiments to validate observation 3 and the results are shown
in Figure 3.5. A trial consists of randomly drawing a set of 4 labels from a given range
of integers, e.g., while using 6-digit integers as labels, we have a range between 100000
and 999999 to choose from. All labels in a given set in a given trial uses the same integer
range. Each time we fill a set with random integers, we store its sum in a common hash
table (separate for each integer range). A collision occurs when the sum of a randomly
drawn set is found to be the same as an already existing sum in the hash table. We note
that the number of collisions is indeed very small when large integers are used as labels,
e.g, there are about 10 collisions (a negligible number) for 100 million trials when the
label is a 12-digit integer.
Also, we observed the probability of collisions by experimenting with different cardinalities of such partition sets drawn at random from a given integer range. We note in
37
Chapter 3. A novel fast subspace clustering algorithm
1,000,000
|Set| = 3
|Set| = 10
|Set| = 20
|Set| = 30
|Set| = 50
100,000
No of collisions
10,000
1,000
100
10
1
0
2
4
6
No. of digits used for labels
8
10
Figure 3.6: Numerical experiments for probability of collisions with respect to the number
of random integers drawn at a time. |Set| denotes cardinality of the set drawn. The
number of trials for each fixed integer-set is 1000000.
Figure 3.6 that the probability of collisions decreases further with higher values of set
sizes. The gap between the number of collisions widens for the larger integer ranges.
3.2.4
Interleaved dense units
Although matching 1-D dense units across 1-dimensional subspaces is a promising approach to directly compute the maximal subspace clusters, it is however difficult to identify these 1-D dense units. The reason is that 1-dimensional subspaces may contain interleaved projections of more than one cluster from higher-dimensional spaces, e.g., in
Figure 3.2, Cluster1 contains projections from both Cluster5 and Cluster6 . The only
1|
way to find all possible dense units from Cluster1 is to generate all possible |Cluster
τ +1
combinations.
Let CS be a core-set of such interleaved 1-D dense points such that, each point in
this core set is within distance of each other, |CS| > τ . A core-set of the points in a
subspace S can be denoted as CS S . In our algorithm, we first find these core-sets in each
dimension and then generate all combinations of size τ + 1 as potential dense units.
38
Chapter 3. A novel fast subspace clustering algorithm
As can be seen from Figure 3.2, many such combinations in the d1 dimension will
not result in dense units in the subspace {d1 , d2 }. Moreover, it is possible that none of
the combinations will convert to any higher-dimensional dense unit. The construction
of 1-dimensional dense units is the most expensive part of our algorithm, as the number
of combinations of τ + 1 points can be very high depending on as the value of will
determine the size of the core-sets in each dimension. However, one clear advantage of
our approach is that this is the only time we need to scan the dataset in the entire algorithm.
3.2.5
Generation of combinatorial subsets
Assuming a core-set CS of c data points, all dense units of size r can be generated from
c
CS using
combinations. There are many algorithms available in the literature to find
r
the combinatorial sequences [64, 65].
We generate the combinatorial subsets of core-sets in a lexicographic order such that
each combination is a sequence l1 , l2 , . . . , lr such that 1 < l1 < l2 < · · · < lr < c. For
example, following are the 10 combinations of size 3 generated from a set {1, 2, 3, 4, 5}:
i: < 1, 2, 3 >
ii: < 1, 2, 4 >
iii: < 1, 2, 5 >
iv: < 1, 3, 4 >
v: < 1, 3, 5 >
vi: < 1, 4, 5 >
vii: < 2, 3, 4 >
viii: < 2, 3, 5 >
ix: < 2, 4, 5 >
39
Chapter 3. A novel fast subspace clustering algorithm
x: < 3, 4, 5 >
Using the initial combination sequence < 1, 2, 3 > as a seed, the next lexicographic
sequence can be generated iteratively using the predecessor as shown in the Algorithm 1.
We notice in the above combinations that each position in the last sequence < 3, 4, 5 >
has reached its saturation point. A position i in a combinatorial sequence is said to have
reached its saturation point if it cannot take any larger value, that is when it has reached
the maximum possible value of c − r + i. Starting with the position r of the predecessor,
we backtrack towards the first position until a position is still active that is, it has not
reached its saturation point. The next sequences are generated from an active position to
the rth position as shown in step 15 to 17 in Algorithm 1 (getDenseUnits). The algorithm
stops when all of the r positions have reached their saturation point. The initial seed is set
to < 0, c − r + 2, c − r + 3, . . . , c.
The SUBSCALE algorithm is explained in the next subsection.
3.2.6
SUBSCALE algorithm
As discussed before, we aim to extract the subspace clusters by finding the dense units in
the relevant subspaces of the given dataset by using L1 metric as the distance measure.
We assume a fixed size of τ + 1 for the dense unit U which is the smallest possible cluster
of dense points. If |CS| is the number of points in a 1-D core-set CS then, we can obtain
|CS|
dense units from one such CS. If we map the data points contained in each dense
τ +1
unit to large integers and compute their sum then each such sum will create a unique signature. From observation 3, if two such signatures match then their corresponding dense
units contain the same points with a very high probability. In the SUBSCALE algorithm,
we first find these 1-D dense units in each dimension and then hash their signatures to a
common hash table (Figure 3.7).
We now explain our algorithm for generating maximal subspace clusters in highdimensional data.
40
1
2
3
4
5
Chapter 3. A novel fast subspace clustering algorithm
Input: CS: A core-set of c points; r: size of each combination to be generated
from the core-set.
c
Output: DenseU nits: A set of
dense units.
r
seed and U are empty arrays of size r each.
for i ← 2 to r do
seed[i] ← c − r + i
end
seed[1] ← 0
/* This step will make sure that first lexicographic sequence will
be generated as a seed.
*/
6
7
8
9
while true do
i←r
while i > 0 and seed[i] = c − r + i do
Decrement i
/* Get the active position.
10
11
12
end
if i = 0 then
break
/* It signifies all combinations have been generated.
13
14
*/
*/
else
temp ← seed[i]
/* Get seed element.
*/
for j ← i to r do
16
k ← temp + 1 + j − i
17
seed[j] ← k
18
U[j] ← CS[k]
19
end
20
Copy dense unit U to the output set of DenseU nits
21
end
22 end
Algorithm 1: getDenseUnits: Find all combination subsets of size r from a core-set of
size c.
15
41
Chapter 3. A novel fast subspace clustering algorithm
Figure 3.7: Signatures from different dimensions collide to identify the relevant subspaces
for corresponding dense units behind these signatures. di is the ith dimension and Sigxi is
the signature of a dense unit, Uxi in ith dimension.
Step 1: Consider a set, K of very large, unique and random positive integers {K1 , K2 , . . .
, Kn }. We define M as a one-to-one mapping function, M : DB → K. Each
point Pi ∈ DB is assigned a unique random integer Ki from the set K.
Step 2: In each dimension j, we have projections of n-points, P1j , P2j , . . . , Pnj . We create
all possible dense units containing τ + 1 points that are within an distance.
Step 3: Next, we create a hash table hT able, as follows. In each dimension j, for every dense unit Uaj , we generate its signature Sigaj . A signature is calculated by
mapping the elements of dense unit Uaj to their corresponding keys from the key
database K and summing them up. The signature thus generated is hashed into
the hT able. Using observation 3, if Sigaj collides with another signature Sigbk
then the dense unit Uaj exists in subspace {j, k} with extremely high probability.
After repeating this process in all single dimensions, each entry of this hash table
will contain a dense unit in the maximal subspace. The colliding dimensions are
stored along with each signature Sigi ∈ hT able.
Step 4: We now have dense units in all possible maximal subspaces. We can use any
full dimensional clustering algorithm on each subspace to process these maximal
dense units into maximal subspace clusters. In our research, we use DBSCAN in
each found subspace for the clustering process. The and τ parameters can be
42
Chapter 3. A novel fast subspace clustering algorithm
P1, P7, P3, P12, P5, P4, P9, P2, P6, ...
Figure 3.8: An example of sorted data points in a single dimension
adapted differently as per the dimensionality of the subspace to handle the curse
of dimensionality.
The pseudo code is given in Algorithm 2 (SUBSCALE), Algorithm 3 (FindSignatures)
and Algorithm 4 (findSum) below.
The Algorithm 2 (SUBSCALE) takes a database of n × k points as input and find maximal subspace clusters by hashing the signatures generated through Algorithm 3 (FindSignatures). The values of τ and are user defined. The key database K is randomly
generated but can also be supplied as input. The hash table hT able can be indexed on
sum value for direct access and thus, faster collisions.
The core-sets are found in each dimension by sorting the projections of data points
in that dimension. Starting with each point, the neighbours are collected until they start
falling out of range. The Algorithm 1 is used to generate all combinatorial dense units
from the core-sets. Algorithm 4 (findSum) finds the signature sum of each dense unit U
which is collided with other signature sums to detect the maximal subspace of a dense
unit.
3.2.7
Removing redundant computation of dense units
The SUBSCALE algorithm can be optimised further for faster execution by removing the
redundant calculations of dense units. The dense units are calculated using combinatorial
mixing of all points from a core-set.
Let us assume a particular dimension d contains sorted data points as: {P1 , P7 , P3
, P12 , P5 , P4 , P9 , P2 , P6 . . . } in that order and let τ = 3. As shown in Figure 3.8, using
43
1
Chapter 3. A novel fast subspace clustering algorithm
Input: DB of n × k points.
Output: Clusters: Set of maximal subspace clusters.
Hash table hT able ← {}
/* An entry in hT able is {sum, U, subspace}.
2
3
for j ← 1 to k do
Signatures ← F indSignatures(DB, j)
/* Get candidate signatures in dimension j.
4
5
6
7
8
9
10
11
12
13
14
*/
*/
for each entry Sigx : {sum, U, subspace} ∈ Signatures do
if there exists another signature Sigy in hT able such that
Sigy .sum = Sigx .sum then
Append dimension j to Sigy .subspace
else
Add new entry Sigx to the hT able
end
end
end
for all entries {Sigx , Sigy , . . . } ∈ hT able do
if Sigx .subspace = Sigy .subspace = . . . then
Add entry {subspace, Sigx .U ∪ Sigy .U ∪ . . . } to Clusters
/* ∪ is a union set-operator. Clusters contain maximal dense
units in the relevant subspaces.
*/
15
16
17
end
end
Run DBSCAN on each entry of Clusters
/* Clusters is resulting set of maximal subspace clusters
Algorithm 2: SUBSCALE: Find maximal subspace clusters.
*/
44
1
2
3
Chapter 3. A novel fast subspace clustering algorithm
Input: DB of n × k points; Dimension j; τ ; and .
Output: Signatures: Set of signatures of the dense units.
Sort P1 , P2 , . . . , Pn , s.t. ∀Px , Py ∈ DB, Pxj ≤ Pyj
for i ← 1 to n − 1 do
CS ← Pi
/* CS is a core-set of dense points.
4
5
6
7
8
9
10
11
12
13
14
15
16
17
*/
numN eighbours ← 1
next ← i + 1
j
while next ≤ n and Pnext
− Pij < do
Append Pnext to CS
Increment numN eighbours
Increment next
end
if numN eighbours > τ then
DenseU nits ← getDenseU nits(|CS|, τ + 1)
end
for each dense unit U ∈ DenseU nits do
sum ← f indSum(U, K)
subspace ← j
Add entry {sum, U, subspace} to Signatures
/* Signatures is a data structure to store the dense units
along with their corresponding signatures in a given
dimension.
*/
end
end
Algorithm 3: FindSignatures: Find signatures of dense units including their sums.
18
19
1
2
3
Input: A dense unit U, A set K of n unique random and large integers
Output: sum: Sum of the keys corresponding to each point in the dense unit.
sum ← 0
for each Pi in U do
sum ← sum + M (Pi )
/* M as a one-to-one mapping function, M : DB → K
*/
end
Algorithm 4: findSum: Calculates the sum of the corresponding keys of points in the
dense unit U.
4
45
Chapter 3. A novel fast subspace clustering algorithm
Step 2 of Algorithm 3, beginning with point P1 , the core-set CS1 = {P1 , . . . , P9 } and
then with next point P7 in this sorted set, the core-set CS2 comes out as:{P7 , . . . , P9 },
based on a given value. The dense units are generated as the combinations of τ +1 points
from the core-sets. We notice that all of the points in CS2 have already been covered by
the CS1 , therefore, CS2 will not generate any new signatures than those generated by the
core-set CS1 .
The reason behind this redundancy is that both CS1 and CS2 share same lastElement
which is, the data point P9 . We can eliminate these computations by keeping a record of
the lastElement of the previous core-set CSi . If the lastElement of the core-set CSi+1
is same as that of the previous core-set, then we can safely drop the core-set CSi+1 . In
7
6
this case, the core-set CS1 generate
dense units out which
= 15 dense units
4
4
will be generated again by CS2 if we do not eliminate it.
Another cause of redundant dense units is the overlapping of points between the consecutive core-sets. For example, as we can see in the Figure 3.8, the core-set CS3 starting
with point P3 will contain points {P3 , . . . , P6 } and we cannot discard this core-set as the
lastElement is not same as that of core-set CS2 . The intersecting set of points between
5
core-sets CS2 and CS3 consists of 5 points :{P3 , P12 , P5 , P4 , P9 }. Thus,
= 5 com4
binations of CS3 would already have been generated by CS2 .
To eliminate the redundant dense unit computations due to overlapping data points,
we can use a special marker in each core-set called pivot which is the position of the
lastElement of the previous core-set. For example, in core-set CS1 , the pivot can be
set to −1 which means that we need to compute all of the combinations from this set as
there is no existence of the previous lastElement and none of the combinations from
this core-set has been computed before (Figure 3.9). There is one more scenario when we
need to re-compute all the combinations of the core-set even when the lastElement from
the previous core-set exists in the current core-set and that happens when 0 ≤ pivot ≤ τ .
Therefore, when pivot > τ in the current core-set, we need not compute τ + 1
combinations for points lying between the index 1 to pivot in the core-set. For core-
46
Chapter 3. A novel fast subspace clustering algorithm
P1, P7, P3, P12, P5, P4, P9, P2, P6, ...
pivot = -1
CS1
discarded
CS2
pivot = 5
CS3
Figure 3.9: An example of overlapping between consecutive core-sets of dense data points
P1, P7, P3, P12, P5, P4, P9, P2, P6
CS1
CS2
pivot = 5
CS3
P 3, P12, P5, P4, P9
P2, P 6
CS32
CS31
Dense Unit
Figure 3.10: An example of using pivot to remove redundant computations of dense units
from the core-sets
set CS3 = {P3 , P12 , P5 , P4 , P9 , P2 , P6 }, the points between index 1 and 5 have already
7
been computed for dense units by CS2 . Instead of computing
= 35 combinations,
4
we can create partial combinations from two partitions of the core-set as shown in Figure
3.10.
As in this case of core-set CS3 , using pivot based computation of dense units will
result in only those dense unit combinations which have not been generated before in
the previous core-sets. When the size of core-sets get bigger, this approach results in
considerable savings in computational time and efficiency. The improved SUBSCALE
algorithm is given below in Algorithm 5 (findOptimalSignatures) and Algorithm 6 (getDenseUnitsPivot). In these algorithms, we use the notation |Set| to denote the number of
elements in the Set.
47
1
2
3
4
5
Chapter 3. A novel fast subspace clustering algorithm
Input: DB of n × k points; Dimension j; τ ; and .
Output: Signatures: Set of signatures of the dense units.
Sort P1 , P2 , . . . , Pn , s.t. ∀Px , Py ∈ DB, Pxj ≤ Pyj
last ← −1
pivot ← −1
for i ← 1 to n − 1 do
CS ← Pi
/* CS is a core-set of dense points
6
7
8
9
10
11
12
13
14
15
16
17
18
19
numN eighbours ← 1
next ← i + 1
j
while next ≤ n and Pnext
− Pij < do
Append Pnext to CS
Increment numN eighbours
Increment next
end
newLast ← CS.lastElement
if newLast! = last then
pivot ← CS.indexOf (last)
last ← newLast
if numN eighbours > τ then
if pivot ≤ τ then
DenseU nits ← getDenseU nits(|CS|, τ + 1)
/* |CS| is the total number of points in CS.
20
21
22
23
24
25
26
27
28
*/
*/
else
DenseU nits ← getDenseU nitsP ivot(CS, pivot)
end
end
end
for each dense unit U found in the previous step do
subspace ← j
sum ← f indSum(U)
Add entry {Sig, U, subspace} to Signatures
/* Signatures is a data structure to store the dense units
along with their corresponding sum values in a given
dimension.
*/
end
30 end
Algorithm 5: findOptimalSignatures: Find optimal signatures of dense units including
their sums.
29
48
1
2
3
4
5
6
7
8
9
10
11
12
Chapter 3. A novel fast subspace clustering algorithm
Input: Core-set CS, pivot.
Output: DenseU nits: Set of dense units each of size τ + 1.
Split CS into CS1 and CS2 such that CS1 contains first 1 . . . pivot points and CS2
contains rest of the points
if |CS2 | > τ then
DenseU nits ← getDenseU nits(|CS2 |, τ + 1)
select ← τ
else
select ← |CS2 |
end
count ← 1
do
Combine getDenseU nits(|CS1 |, τ + 1 − count) and
getDenseU
count)
nits(|CS2 |, whichare partial dense units, to generate a
|CS1 |
|CS2 |
total of
×
dense units and add them to the set
τ + 1 − count
count
DeneU nits
count ← count + 1
while count ≤ select
Algorithm 6: getDenseUnitsPivot: Find dense units using the pivot.
In the next section, we evaluate and discuss the performance of our proposed subspace
clustering algorithm.
3.3
Results and discussion
We experimented with various datasets upto 500 dimensions (Table 3.3). Also, we compared the results from our algorithm with other relevant state-of-the-art algorithms. We
fixed the value of τ = 3 unless stated otherwise and experimented with different values
of for each dataset, starting with the lowest possible value. The minimum cluster size
(minSize) is set to 4.
3.3.1
Methods
We implemented the SUBSCALE algorithm in Java language on an Intel Core i7-2600
desktop with 64-bit Windows 7 Operating system and 16GB RAM. The dense points
in maximal subspaces were found through the SUBSCALE algorithm and then for each
49
Chapter 3. A novel fast subspace clustering algorithm
Table 3.3: List of datasets used for evaluation
Data
D05
D10
D15
D20
D25
D50
S1500
S2500
S3500
S4500
S5500
madelon
Size
1595
1595
1598
1595
1595
1596
1595
2658
3722
4785
5848
4400
Dimensionality
5
10
15
20
25
50
20
20
20
20
20
500
found subspace containing dense set of points, we used a Python Script to apply DBSCAN
algorithm from the scikit library [66]. However, any full-dimensional density-based algorithm can be used instead of DBSCAN. The open source framework by Müller et al. [67]
was used to assess the quality of our results and also to compare these results with other
clustering algorithms available through the same platform.
The datasets in Table 3.3 were normalised between 0 and 1 using WEKA [68] and
contained no missing values. The data sets are freely available at the website of authors of
the related work [67]. The 4400 × 500 madelon dataset available at UCI repository [69]
The source code of the SUBSCALE algorithm can be downloaded from the Git repository [70].
3.3.2
Execution time and quality
The effects of changing values on runtime is shown in Figure 3.11 for two datasets
of dimensionality 5 and 50. The larger values results in bigger neighbourhoods and
generate large number of combinations, therefore, more execution time.
50
Chapter 3. A novel fast subspace clustering algorithm
60,000
50,000
D05
D50
Runtime (ms)
40,000
30,000
20,000
10,000
0
0
0.001
0.002
0.003
Epsilon
0.004
0.005
Figure 3.11: Effect of on runtime on two different datasets of 5 and 50 dimensions. The
clustering time increases with the increase in -value due to bigger neighbourhood. More
number of combinations needs to be generated for a bigger core-set.
One of the most commonly used method to assess the quality of the clustering result in
the related literature is F1 measure [71]. The open source framework by Muller et al. [67]
was used to assess the F1 quality measure of our results. According to F1 measure, the
clustering algorithm should cover as many points as possible from the hidden clusters and
as fewer as possible of those points which are not in the hidden clusters. F1 is computed
as the harmonic mean of recall and precision. The recall value accounts for the coverage
of points in the hidden clusters by the found clusters. The precision value measures the
coverage of points in found clusters from other clusters. A high recall and precision value
means high F 1 and thus, better quality [71].
Figure 3.12 shows the F1 values for different datasets and with different epsilon settings. We notice that the quality of clusters deteriorates beyond a certain threshold of because clusters can get artificially large due to the larger value. Thus, a larger value of
is not always necessary to get better quality clusters.
51
Chapter 3. A novel fast subspace clustering algorithm
0.0022
0.002
Epsilon
0.0018
0.0016
S2500
S3500
S4500
S5500
D20
D25
0.0014
0.0012
0.001
0.4
0.5
0.6
F1
0.7
0.8
0.9
1
Figure 3.12: -value versus F1 measure for six different datasets. The change in epsilonvalue seems to have similar impact on the cluster quality (F1 measure) for each dataset.
The cluster quality degrades for bigger epsilon-values.
We evaluated the SUBSCALE algorithm against the INSCY algorithm which is the
recent state-of-the-art subspace clustering algorithm. The INSCY algorithm is available
through the same framework used for F1 evaluation [67]. We used same parameter values
(, τ and minSize) for both INSCY and SUBSCALE algorithms. As shown in Figures
3.13 and 3.14, the SUBSCALE algorithm gave much better runtime performance for similar quality of clusters (F1=0.9), particularly for higher-dimensional datasets.
Figure 3.15 shows the effect of increasing dimensionality of the data on the runtime
for the SUBSCALE algorithm versus other subspace clustering algorithms (INSCY, SUBCLU and CLIQUE available on Open source framework [67]), keeping the data size fixed
(Data used: D05, D10, D15, D20, D25, D50, D75). The runtime axis of Figure 3.15 is
plotted on the logarithmic scale. As we notice that CLIQUE algorithm shows the worst
performance in terms of scalability w.r.t the dimensionality of the dataset. The parameters of τ = 0.003 and ξ = 3 were used for CLIQUE while similar values of = 0.001
and τ = 3 were used for SUBSCALE, INSCY and SUBCLU algorithms. However, the
52
Chapter 3. A novel fast subspace clustering algorithm
160,000
140,000
SUBSCALE
Inscy
Runtime (ms)
120,000
100,000
80,000
60,000
40,000
20,000
0
S1500 S2500 S3500
D20
D25
Clustering results with F1=0.9
D50
Figure 3.13: Runtime comparison between SUBSCALE and INSCY algorithms for similar quality of clusters (with F1 measure 0.9) for six datasets of different sizes and different
number of dimensions. The SUBSCALE algorithm gave better performance than the INSCY algorithm.
53
Chapter 3. A novel fast subspace clustering algorithm
20,000
SUBSCALE
Inscy
Dataset: D25 (1595 x 25)
Runtime (ms)
15,000
10,000
5,000
0
0.4
0.5
0.6
F1
0.7
0.8
0.9
1
Figure 3.14: Runtime comparison between SUBSCALE INSCY algorithms for the same
dataset but using different quality output of clusters. The cluster quality can be changed
by changing the epsilon-value used for finding the subspace clusters. The SUBSCALE
algorithm gave better performance than the INSCY algorithm.
54
Chapter 3. A novel fast subspace clustering algorithm
100,000
SUBSCALE
INSCY
SUBCLU
CLIQUE
Runtime (ms)
10,000
1,000
100
10
5
10
15
Dimensionality
20
25
Figure 3.15: Runtime comparison between different subspace clustering algorithms for
fixed data size=1595 and with dimensions varing from 5 to 25. As mentioned in the
discussions, the SUBSCALE algorithm gave best performance.
SUBCLU algorithm did not give much meaningful clusters as even single points were
shown as clusters in the result and CLIQUE algorithm crashed for ≥ 15 dimensions. We
observe that the SUBSCALE algorithm clearly performs better than the rest of algorithms
with the increase in number of dimensions. Figure 3.16 shows the runtime comparison of
our algorithm with INSCY and SUBCLU for a fixed number of dimensions and varying
the size of the data from 1500 points to 5500 points (Dataset used: S1500, S2500, S3500,
S4500, S5500).
The 4400 × 500 madelon dataset has ∼ 2500 possible subspaces. We ran the SUBSCALE algorithm on this dataset to find all possible subspace clusters and Figure 3.17
shows the runtime performance with respect to values. The different values of ranging
from 1.0 × 10−5 to 1.0 × 10−6 were used. We tried to run INSCY by allocating upto 12GB
RAM but it failed to run for this high-dimensional dataset for any of the values.
55
Chapter 3. A novel fast subspace clustering algorithm
90,000
80,000
Runtime (ms)
70,000
SUBSCALE
INSCY
SUBCLU
60,000
50,000
40,000
30,000
20,000
10,000
0
1000
2000
3000
4000
Size of dataset
5000
6000
Figure 3.16: Runtime comparison between different subspace clustering algorithms for
fixed dimensionality = 20 with data size varying from 1500 to 5500.
180,000
Dataset: madelon
Runtime (ms)
160,000
138635
Clusters
140,000
120,000
100,000
80,000
345
60,000Clusters
40,000
0
41391 Clusters
4897 Clusters
5,000
10,000 15,000 20,000
No. of subspaces
25,000
30,000
Figure 3.17: Number of subspaces/clusters found vs runtime (madelon dataset). τ =
3, minSize = 4. Different values ranging from 1.0E − 5 to 1.0E − 6 were used. The
number of clusters as well as subspaces in which these clusters are found, increases with
the increase in value.
56
Chapter 3. A novel fast subspace clustering algorithm
3.3.3
Determining the input parameters
An unsupervised clustering algorithm has no prior knowledge of the density distribution
for the underlying data. Even though the choice of density measures (, τ and minSize)
is very important for the quality of subspace clusters, finding their optimal values is a
challenging task.
The SUBSCALE algorithm initially requires both τ and parameters to find 1-D dense
points. Once we identify the dense points in the maximal subspaces through the SUBSCALE algorithm, we can then run the DBSCAN algorithm for the identified subspaces
by setting the τ , and minSize parameters according to each subspace. We should mention that finding clusters from these subspaces by running the DBSCAN algorithm or any
other density based clustering algorithm will take relatively very less time. During our
experiments, the average time taken by the DBSCAN algorithm comprises less than 5%
of the total execution time for the evaluation datasets given in Table 3.3. The reason is
that each of the identified maximal subspaces by the SUBSCALE algorithm has already
been pruned for only those points which have very high probabilities of forming clusters
in these subspaces.
In our experiments, we started with the smallest possible -distance between any two
1-D projections of points in the given dataset. Considering the curse of dimensionality,
the points become farther apart in high-dimensional subspaces, so the user may intend
to find clusters with larger values than that used by the SUBSCALE algorithm for 1-D
subspaces. Most of the subspace clustering algorithms use heuristics to adjust these parameters in higher-dimensional subspaces. Some authors [72,73] have suggested methods
to adapt and τ parameters for high-dimensional subspaces. However, we would argue
that the choice of these density parameters is highly subjective to the individual data set
as well as the user requirements.
57
Chapter 3. A novel fast subspace clustering algorithm
3.4
Summary
In this chapter, we have presented a novel approach to efficiently find the quality subspace
clusters without expensive database scans or generating trivial clusters in between. We
have validated our idea both theoretically as well as through numerical experiments.
Using the SUBSCALE algorithm, we have experimented with 5 to 500 dimensional
datasets and analysed in detail its performance as well as the factors influencing the quality of the clusters. We have also discussed various issues governing the optimal values of
the input parameters and the flexibility available in the SUBSCALE algorithm to adapt
these parameters accordingly.
Since our algorithm directly generates the maximal dense units, it is possible to implement a query-driven version of our algorithm relatively easily. Such an algorithm will
take a set of (query) dimensions and find the clusters in the subspace determined by this
set of dimensions.
However, the main cost in the SUBSCALE algorithm is computation of candidate
one-dimensional dense units. All of these dense units need to be stored in a hash table
in working memory first so that the collisions could be identified. So, the efficiency of
the SUBSCALE algorithm seems to be limited by the availability of working memory.
But, this algorithm has a high degree of parallelism as there is no dependency in computing these dense units across different dimensions. The computations can be split and
processed as per the availability of working memory. We discuss these scalability issues
and possible solutions in the next chapter.
Chapter 4
Scalable subspace clustering
4.1
Background
Datasets have been growing exponentially in recent years across various domains of
healthcare, sensors, Internet and financial transactions [74, 75]. A recent IDC report has
predicted that the total data being produced in the world will grow up to 44 trillion gigabytes by 2020 [76]. In fact, accumulated data has been following Moore’s law that is
doubling in size every two years. This explosion has been both in size as well as dimensions of the data, for example, sophisticated sensors these days can measure increasingly
large number of variables like location, pressure, temperature, vibration, humidity etc.
Therefore, a subspace clustering algorithm should be scalable with both size and dimensions of the data.
The SUBSCALE algorithm presented in the previous chapter needs to be explored
further to accommodate required scalability. As the data grows in size and/or dimensions,
the number of hidden clusters are also expected to grow. However, the number of clusters will depend upon the underlying data distribution and the density parameter settings.
Sometimes a smaller dataset can have larger number of clusters than a bigger dataset. As
shown in Figure 4.1, only one cluster exists in Data Set 1 which is seemingly bigger in
58
59
Chapter 4. Scalable subspace clustering
Data Set 1
Data Set 2
Figure 4.1: The number of clusters from a bigger data set (left) can be less than the
number of clusters hidden in the smaller data set (right)
size than Data Set 2 which has three clusters hidden inside it. The reason is that Data Set
1 is much more uniformly distributed than Data Set 2.
On the other hand, increase in dimensions has a different impact than increase in
size of the dataset, on the number of clusters. As discussed in chapter 1, due to the
curse of dimensionality, data appears to be sparse in higher dimensions. The lack of
contrast among data points in higher-dimensional subspaces results in decrease in the
number of clusters as the data points appear to be equidistant from each other. Figure
4.2 shows an example of increase in sparsity among data points as we move from 1dimensional to 2 and 3 dimensional data space. The three groups of points represented by
red, green, and blue which were closer in the 1-dimensional space becomes farther apart
in the 3-dimensional space. This is the reason why most of the algorithms which use all
dimensions of the data to measure the proximity among points report fewer clusters for
high-dimensional data.
One more implication of this curse of dimensionality is that in high-dimensional data,
clusters are often hidden in lower-dimensional projections. As the number of dimensions
increases, the number of lower-dimensional projections (subspaces) also grow. Table 4.1
shows increase in the number of low-dimensional subspaces as the total number of dimensions increases from 10 to 10000. We have discussed some of the clustering algorithms in
60
Chapter 4. Scalable subspace clustering
3-dimensional space
2-dimensional space
1-dimensional space
Figure 4.2: Data sparsity with increase in the number of dimensions. There are fewer
clusters in high-dimensional dataset. The clusters lie in lower-dimensional subspaces of
the data.
detail in chapter 3 and have also highlighted reasons for why these clustering algorithms
struggle to find all hidden clusters in high-dimensional data.
The top-down projected clustering algorithms cannot handle high-dimensional data
as these are more of a space partitioning algorithms and moreover, user has to define the
number of clusters and the relevant subspaces. As clustering is an unsupervised data mining process, we do not have prior information about the underlying data density. Without
exploring all possible subspaces it is not possible to find all hidden clusters, even though
Table 4.1: Number of subspaces with increase in dimensions
Size of
the subspace
2
3
4
5
10
45
120
210
252
Total number of dimensions
100
1000
10000
4950
499500
49995000
161700
166167000
166616670000
3921225
41417124750
416416712497500
75287520 8250291250200 832500291625002000
61
Chapter 4. Scalable subspace clustering
the number of clusters may turn out to be really small at the end. Most of the subspace
clustering algorithms explode as the number of dimensions increases due to the exponential search space.
The importance of the SUBSCALE algorithm becomes significant when it comes to
dealing with high-dimensional data to find subspace clusters. The SUBSCALE algorithm
computes the maximal subspace clusters directly through 1-dimensional dense points and
without the need for exploring each and every lower-dimensional subspace. Each cluster
is detected in its relevant maximal subspace and all possible non-trivial subspace clusters
are detected.
The main issue with the SUBSCALE algorithm proposed in the previous chapter is
that it should be able to store all of the 1-dimensional dense points in the working memory
of the system. So, the size of the working memory becomes a constraint for the scalability
of this algorithm. Even though the cost of RAM (Random Access Memory) is coming
down with each passing year, memory requirements can go extremely large for bigger
data sets. Ideally, a subspace clustering algorithm should be able to crunch bigger datasets
within the available memory.
In this chapter, we aim to modify the SUBSCALE algorithm so that it can handle
bigger datasets with limited working memory. In the next section, we discuss in detail
the memory bottleneck caused during SUBSCALE computations. In section 4.3, we look
at the solutions to avoid large memory requirement and propose a scalable algorithm in
section 4.4. The experimental results with the scalable algorithm are analysed in section
4.5.
4.2
Memory bottleneck
The main computation of the SUBSCALE algorithm is based on generation of the dense
units across single dimensions. Even if these dense units are not combined iteratively
in a step-by-step bottom-up manner as in the other subspace clustering algorithms, the
62
Chapter 4. Scalable subspace clustering
=
Figure 4.3: The information about a signature generated from a dense unit is stored in a
Sig data structure. The information contains sum value, points in the dense unit and a set
of dimensions in which this signature exists.
signatures of the dense units still need to be matched with each other. A common hash
table is required to match the dense units from different dimensions.
Recalling the SUBSCALE algorithm from chapter 3, each of the k-dimensional n
data points is mapped to a unique key from a pool of n large-integer keys. The dense
units containing τ + 1 points are computed in each dimension and their signatures are
calculated. A signature of a dense unit is the sum of the corresponding keys of the points
contained in a dense unit. The value of a signature is thus, a large integer. If signatures of
two dense units from different dimensions collide, that is, both have the same value, then,
both dense units have exactly the same points in them with very high probability.
The information about a dense unit along with its signature and the subspace in which
it exists, is kept in a signature node, called Sig (Figure 4.3). Initially, the subspace information in each Sig contains the dimension in which the corresponding dense unit is
created. When two dense units from two different dimensions collide, their subspace
fields are merged.
As proposed in the previous chapter, a common hash table is used to collide the signatures from different single dimensions. After all of the single dimensions have finished
their collisions, the maximal sets of dense units are collected from the hash table. Each
of the k-dimensions can generate different number of signatures. When two signatures
collide with each other in the hash table, the complete details of the second signature need
not be stored in the hash table as its sum value will be similar to the first one and only
the subspace of the second signature is recorded. Therefore, the total capacity of the hash
table should be roughly less than the total number of signatures in all dimensions.
Considering the hash table in Figure 4.4, let p, q, . . . , r be the total number of signatures generated in each dimension from 1 to k and collisionConstant is the number
63
Chapter 4. Scalable subspace clustering
Figure 4.4: Signatures from different dimensions collide in a common hash table.
of signatures which collided with the signatures already in hash table and thus, does not
need extra space in the hash table. The hash table should have enough capacity to store
(p×q×· · ·×r)−collisionConstant signature nodes. As said earlier, we do not have prior
information about the clusters and hence, we do not know which dense units or signatures
will collide. Therefore, the value of collisionConstant is not known before hand. Even if
the collisionConstant is known through some magic, the resulting memory requirement
can be enormous for high-dimensional data sets.
Depending upon the underlying data distribution, excessive number of dense units can
be generated from a given dataset. The hash table needs to have enough capacity to store
the signatures generated from these dense units.
The underlying premise of the SUBSCALE algorithm is that if a dense unit Ux with a
signature Sigxi exists in the ith dimension and it also exists in the pth and q th dimensions,
then this unit will have the same signature in all of the three individual dimensions. To
check for the maximal subspace {i, p, q} for dense unit Ux , we need to check for the
collisions of the signatures Sigxi , Sigxp and Sigxq using a common hash table.
The madelon dataset in UCI repository [69] has 4400 data points in 500 dimensions.
Using parameters = 0.000001 and τ = 3, we calculated core-sets in each dimension
and found that a total of 29350693 dense units are expected according to the Algorithm 5
64
Chapter 4. Scalable subspace clustering
given in previous chapter. Each dense unit will have 4 data points because τ = 3. If we
use random 14-digit integers as the keys, each sum generated from a dense unit can be up
to 15-digit long integer and would require unsigned long long int or similar data type in
a typical Java or C programming language.
Referring to a signature node Sigx corresponding to a dense unit Ux , fixed space is
required to store the sum and the τ + 1 dense points. But we cannot determine the space
requirement for the subspace of Ux before all of the collisions from all single dimensions
have taken place. Let us assume that no two dense units from different dimensions collided with each other and all of the 29350693 signatures need to be stored. In that case,
the subspace field in Sigx will contain only a single dimension in which Sigx was generated. We assume that the data point ids and dimensions can be represented by int data
type. The total space required for each signature node will be: sizeof (unsigned long
long + (τ + 1) ∗ sizeof (int) + sizeof (int). On a typical machine, a signature node takes
atleast 144 bytes of memory space. Therefore, the total space requirement for a hash table
to store 29350693 entries is approximately 4 GB.
As the size of data grows, this memory requirement for the hash table will increase
substantially. Therefore, the SUBSCALE algorithm need to be reworked to accommodate
the scalability, irrespective of the main memory constraint, which is still a bottleneck in
its performance.
In the next section we examine the hash table and the computations involved in it so
as to improve the SUBSCALE algorithm for bigger data sets.
4.3
Collisions and the hash table
Let us revisit the Algorithm 2 from chapter 3. The SUBSCALE algorithm proceeds sequentially with respect to the dimensions (Step 2 of Algorithm 2). A dimension (i + 1)
will be processed after the dimension i has finished the collisions of its signatures in the
hash table hT able (Step 8). If the hash table capacity has already reached its maximum
65
Chapter 4. Scalable subspace clustering
by the time the (i + 1)th dimension is processed then hT able is more likely to either
crash or give memory overflow error. If there was a maximal dense unit in a subspace
{i, i + 1, i + 2}, it would not be detected. To check for a maximal subspace in which a
dense units exists, we need to check for its collision by processing other dense units in all
single dimensions.
As discussed before, the total number of expected dense units can be pre-calculated
from the core-sets in all single dimensions. One approach to tackle the limited memory
could be to divide the dimensions into t non-overlapping sets: {dimSet1 , dimSet2 , . . . ,
dimSett } where each dimSet is a collection of dense units from one or more single
dimensions. No dimension participates partially in a dimSet. These dimension sets can
be processed independently either individually or together as a combination of more than
one dimSet. The choice will depend upon the availability of the working memory to find
partial maximal subspace dense units in that set. After finishing the collisions from a
dimSet, the partial maximal dense units can be swapped back to the secondary storage.
After processing all of such dimension sets or combination of dimension sets, these partial
maximal dense units can be combined (using signature collisions again) to get maximal
subspace dense units. For example, if Ux = {P1 , P2 , P3 , P4 } is a partial maximal dense
unit in dimSet : {2, 3, 4} and if it also exists in dimSet : {5, 6, 7} then Ux exists in the
union of these sets, that is, dimSet : {2, 3, 4, 5, 6, 7}.
There are two problems with the above approach. Firstly, the density distribution of
the projections of data points in each dimension is different. For bigger datasets, the
number of combinatorial dense units from a single dimension can surpass the available
memory. To find even partially maximal dense units, we need to process more than 1
dimension from a dimension set. Secondly, the partial maximal dense units need to be
combined with all of the other dense units using collisions to find the complete set of
maximal dense units. Both of these arguments make this approach highly inefficient.
An alternative way is to let dense units from all single dimensions collide in the hash
table but with a control over the number of dense units being generated and stored. We
66
Chapter 4. Scalable subspace clustering
know that each dense unit is identified with its signature, which is an integer value. If
a dense unit matches with another dense unit, the signatures of both will come from the
same range of integer values. We can split the combinatorial dense unit computation into
slots where each slot has an integer range for allowed signature values. The size of a slot
can be adapted according to the available memory and thus, hashing of signatures can fit
in the given hash table. We discuss this approach in detail in the following subsection.
4.3.1
Splitting hash computations
The performance of our algorithm depends heavily on the number of dense units being
generated in each dimension. The number of dense units in each dimension is derived
from the size of the data, chosen value of and the underlying data distribution. A larger
increases the range of the neighbourhood of a point and is likely to produce bigger
core-sets. As we do not have prior information about the underlying data distribution, a
single dimension can have a large number of dense units. Thus, for a scalable solution
to subspace clustering through the SUBSCALE algorithm, the system must be able to
handle a large number of collisions of the dense units.
In order to identify collisions among dense units across multiple dimensions, we need
a collision space (hT able) big enough to hold these dense units in the working memory
of system. But with the limited memory availability, this is not always possible. If numj
is the number of total dense units in a dimension j, then a k-dimensional dataset may
Pk
j
have N U M =
j=1 num dense units in total. The signature technique used in the
SUBSCALE algorithm has a huge benefit that we can split N U M to the granularity of k.
As each dense unit contains τ + 1 points and we are assigning large integers to these
points from the key database K to generate signatures H, the value of any signature thus
generated would approximately lie between the range R = (τ + 1) × min(K), (τ +
1) × max(K) , where min(K) and max(K) are respectively the smallest and the largest
keys being used.
67
Chapter 4. Scalable subspace clustering
Figure 4.5: Illustration of splitting hT able computations. For τ = 3, Split factor sp = 3,
minimum large-key value min(K) = 1088 and maximum large-key value max(K) =
9912, approximate range of expected signature sums is (1088 × (τ + 1), 9912 × (τ + 1)).
Each signature sum is derived from a dense unit of fixed size = τ + 1.
The detection of maximal dense units involves matching of the same signature across
multiple dimensions using a hash table. Thus, the hash table should be able to retain
all signatures from a range R. We can split this range into multiple chunks so that each
chunk can be processed independently using a much smaller hash table. If sp is the split
factor, we can divide the range R into sp parts and thus, into sp hash tables where each
hT able holds at the most
R
sp
entries for these dense units. But since the keys are being
generated randomly from a fixed range of digits, the actual entries will be very less. In a
14-digit integer space, we have 9 × 1013 keys to choose from (9 ways to choose the most
significant digit and 10 ways each for the rest of the 13 digits). The number of actual keys
being used will be equal to the number of points in the dataset |DB|.
68
Chapter 4. Scalable subspace clustering
In Figure 4.5, we illustrate the splitting of hT able computations with a range of 4-digit
integers from 1000 to 9999. Let |DB| = 500 and so we need 500 random keys from 9000
available integer keys. If τ = 3 then, some of these 500 keys will form dense core-sets.
Let us assume that 1/5th of these keys are in a core-set in some 1-dimensional space, then
100
we would need a hash table big enough to store ≈ 4 million entries, which is
. If
4
we choose the split factor of 3 then we have 3 hash tables where each hash table can store
approx 1 million of entries. Typically, Java uses 8 bytes for a long key, so 32 bytes for
a signature with τ = 3 and additional bytes to store the colliding dimensions (say ≈ 40
bytes per entry for an average subspace dimensionality of 10), are required.
In the next section, we explain the scalable version of the SUBSCALE algorithm
followed by its experimental evaluation and analysis of the results.
4.4
Scalable SUBSCALE algorithm
The SUBSCALE algorithm from Chapter 3 was redesigned to accommodate scalability
with bigger datasets. The pseudo code for the scalable SUBSCALE algorithm is given in
Algorithm 7 (scalableSUBSCALE) below. Instead of finding core-sets in each dimension
and simultaneously hashing them into the hash table hT able as in previous chapter, the
core-sets in all single dimensions are precomputed (step 1 of Algorithm 7) using Algorithm 8 (findCoreSets).
The split factor sp is supplied by the user along with and τ values. Each of the
sp slices generates candidate signatures using Algorithm 9 (findSignaturesInRange), between LOW and HIGH values computed in steps 2-4 of Algorithm 7. Each entry Sigx
of hT able is a signature node: {sum, U, subspace} corresponding to a dense unit U.
It is expected that a large number of subspaces containing maximal dense units will
be detected for bigger datasets. To avoid memory overflow, we store these maximal
dense units in the relevant file storage (steps 19-21 of Algorithm 7). The relevant file
is named as the size of the subspace in which a maximal dense unit is detected. Thus, all
69
Chapter 4. Scalable subspace clustering
2-dimensional maximal dense units will be stored in a file named ‘2.dat’, 3-dimensional
maximal dense units will be stored in a file named ‘3.dat’ and so on. These files can be
processed later using a scripting language to run DBSCAN or similar cluster generation
algorithm on the already detected dense points in these files.
The Algorithm 10 (denseUnitsInRange) is a modified version of Algorithm 6 (getDenseUnitsPivot) from the last chapter. The main difference is an additional check to
maintain signature sum values between LOW and HIGH . Also, instead of generating all dense units from a core-set and then filtering them for those with signature sums
between a given LOW and HIGH range, the algorithm is optimised (steps 22-32 of Algorithm 10) with a condition check such that, the core-set processing will stop when all of
the next dense unit generated from the core-set are expected to have signature sums more
than or equal to HIGH.
In Algorithm 10, seed, tempseed and U are initialized as empty arrays of size r each
and sums is initialized as an empty array of size r + 1.
4.5
Experiments and analysis
We implemented the SUBSCALE algorithm in Java language on an Intel Core i7-2600
desktop with 64-bit Windows 7 OS and 16GB RAM. The pedestrian data set is extracted
from the attributed pedestrian database [77, 78] using the Matlab code given in APiS1.0
[79]. The madelon data set available at UCI repository [69]. Both of the datasets were
normalised between 0 and 1 using WEKA [68] and contained no missing values. The
source code of scalable version of the SUBSCALE algorithm is available at [80].
We ran the SUBSCALE algorithm on madelon dataset of 4400 points and 500 dimensions using different values of sp ranging from 1 to 500 and Figure 4.6 shows its effect on
the runtime performance for two different values of . The darker line is for larger value
of and hence, higher runtime. But for both of the values, the execution time is almost
proportional to the split factor after an initial threshold.
70
1
Chapter 4. Scalable subspace clustering
Input: DB of n × k points; Set K of n unique, random and large integers; ; τ ;
Split factor sp
Output: Clusters: Set of maximal subspace clusters
CS 1 , CS 2 , . . . , CS k ← F indCoreSets(DB)
/* Get core-sets in all k dimensions.
2
3
/* max(K) and min(K) are the maximum and minimum values of keys
available in the key database K.
4
5
6
8
9
10
12
13
15
16
17
18
19
20
21
*/
for each candidate Sigx ∈ CandidateN odes do
if there exists a signature node, Sigy in hT able, such that
Sigx .sum = Sigy .sum then
Sigy .subspace ← Sigy .subspace ∪ Sigx .subspace
/* ∪ is a union set-operator
14
*/
LOW ← HIGH
HIGH ← LOW + SLICE
for j ← 1 to k do
CandidateN odes ← f indSignaturesInRange(CS j , LOW, HIGH)
/* Get candidate signature nodes from core-sets in CS j .
11
*/
HIGH ← LOW
for split ← 1 to sp do
Hashtable hT able ← {}
/* Initialise an empty hash table.
7
*/
SLICE ← ((max(K) − min(K)) ∗ (τ + 1))/sp
LOW ← min(K) ∗ (τ + 1)
*/
else
Add Sigx to hT able
end
end
end
for all entries {Sigx , Sigy , . . . } ∈ hT able do
if Sigx .subspace = Sigy .subspace = . . . then
Add entry {Sigx .U ∪ Sigy .U ∪ . . . } to |subspace|.datf ile
/* ∪ is a union set-operator. |subspace| is the number of
dimensions in the subspace. Each of the d.dat file
contain maximal dense units in the relevant
d-dimensional subspaces.
*/
end
23
end
24 end
25 Run any full dimensional clustering algorithm, for example, DBSCAN on each
entry of the d.dat file to output maximal subspace Clusters
Algorithm 7: scalableSUBSCALE: Scalable version of the SUBSCALE algorithm
22
71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Chapter 4. Scalable subspace clustering
Input: DB of n × k points, τ and Output: A collection of core-sets CS 1 , CS 2 , . . . , CS k in all k dimensions
for j ← 1 to k do
Sort points P1 , P2 , . . . , Pn , s.t. ∀Px , Py ∈ DB, Pxj ≤ Pyj
last ← −1
x←1
for i ← 1 to n − 1 do
tempSet ← Pi
numN eighbours ← 1
next ← i + 1
j
while next ≤ n and Pnext
− Pij < do
Append point Pnext to tempSet
Increment numN eighbours
Increment next
end
newLast ← lastElement in tempSet
if newLast! = last then
last ← newLast
if numN eighbours ≥ τ then
CSxj = tempSet
/* CSxj is a core-set of dense points.
19
20
21
22
23
Increment x
end
end
end
end
Algorithm 8: FindCoreSets: Find core sets in the given dataset.
*/
72
Chapter 4. Scalable subspace clustering
Input: Core-sets CS j from j th dimension, LOW , HIGH
/* For readability, we drop the dimension index j from CS
1
2
*/
Output: Set of candidate signature nodes: CandidateN odes
lastElement ← −1
for i ← 1 to |CS| do
/* |CS| is the number of core-sets
*/
pivot ← indexOf (lastElement) in CSi
4
if pivot ≤ τ then
5
DenseU nits ← denseU nitsInRange(|CSi |, τ + 1)
6
else
7
Split CSi into CSi1 and CSi2 such that CSi1 contains first 1 . . . p points and
CSi2 contains the rest of the points
8
if |CSi2 | > τ then
9
DenseU nits ← denseU nitsInRange(|CSi2 |, τ + 1)
10
select ← τ
11
else
12
select ← |CSi2 |
13
end
14
count ← 1
15
do
16
partial1 ← partialDenseU nitsInRange(|CSi1 |, τ + 1 − count)
17
partial2 ← partialDenseU nitsInRange(|CSi2 |, count)
18
for p ← 1 to |partial1 | do
19
for q ← 1 to |partial2 | do
20
if LOW ≤ f indSum(partial1 [p]) + f indSum(partial2 [q]) <
HIGH then
21
Merge both dense units partial1 [p] and partial2 [q] and add
to set DenseU nits
22
end
23
end
24
end
25
Increment count
26
while count ≤ select
27
end
28 end
29 for each dense unit U ∈ DenseU nits do
30
sum ← f indSum(U, K)
31
subspace ← j
32
Add signature {sum, U, subspace} to CandidateN odes
33 end
Algorithm 9: findSignaturesInRange: Find candidate signature nodes in a given coreset and with signature sum between LOW and HIGH.
3
73
Chapter 4. Scalable subspace clustering
Input: CS, r, K, and is its complete dense unit of size τ + 1.
Output: DenseU nits: A set of dense units, each of size r.
1 for i ← 1 to c do
2
localKeys[i] ← M (CS[i] 7→ K)
3 end
4 Sort localKeys in ascending order
5 for i ← 2 to r do
6
seed[i] ← c − r + i
7 end
8 seed[1] ← 0
9 while true do
10
i←r
11
while i > 0 and seed[i] = c − r + i do
12
Decrement i /* Get the active position.
*/
13
end
14
if i = 0 then
15
break /* All combinations have been generated.
*/
16
else
17
temp ← seed[i] /* Get seed element.
*/
18
for j ← i to r do
19
k ← temp + 1 + j − i
20
tempseed[j] ← k
21
tempsum ← sums[j] + localKey[temp]
22
if tempsum ≥ HIGH then
23
f lag ← true /* Skip the rest of the computations
*/
24
while (j > 2) and ((tempseed[j] − tempseed[j − 1]) < 2) do
25
Decrement j
26
end
27
while j ≤ r do
28
tempseed[j] ← c − r + j
29
Increment j /* Reset the seed
*/
30
end
31
break
32
end
33
sums[j + 1] ← tempsum
34
U[j] ← M (K[localKey[temp]] 7→ DB)
35
end
36
seed ← tempseed
37
f lag ← f alse /* Go to the next iteration
*/
38
if f indSum(U) ≥ LOW then
39
Copy dense unit U to the output set DenseU nits
40
end
41
end
42 end
Algorithm 10: denseUnitsInRange: Find all combination dense units of size r from
a core-set CS such that the signature sum of each dense unit is less than HIGH but
greater than or equal to LOW .
74
Chapter 4. Scalable subspace clustering
Input: CS, r, K, and is its complete dense unit of size τ + 1.
Output: DenseU nits: A set of dense units, each of size r.
1 This algorithm is similar to Algorithm 10 (denseUnitsInRange) except the last step.
No check is made for f indSum(U) ≥ LOW .
2 The dense unit U is simply copied to the output set DenseU nits.
Algorithm 11: partialDenseUnitsInRange: Find all combination dense units of size
less than r from a core-set CS such that the signature sum of each dense unit is less
than HIGH.
600,000
Runtime (ms)
500,000
Epsilon=2.0E−6
Epsilon=5.0E−6
400,000
300,000
200,000
100,000
0
0
100
200
300
Split Factor
400
500
Figure 4.6: Runtime vs split factor for madelon dataset. The execution time is almost
proportional to the split factor after an initial threshold.
75
Chapter 4. Scalable subspace clustering
The size of hT ablei can be adjusted to handle large databases of points by choosing
an appropriate split factor sp. Instead of generating all dense units in all dimensions
sp times, if we sort the keys of density-connected points, we can stop the computation
of dense units when no more signature sums less than the upperlimit of hT ablei are
possible.
We successfully ran this modified SUBSCALE algorithm for 3661 × 6144 pedestrian dataset. We used = 0.000001, τ = 3, minSize = 4, sp = 4000 and it took 336
hrs to finish and compute dense units in 350 million subspaces.
We encountered memory overflow problems when handling a large number of subspace clusters and that was due to the increasing size of Clusters data structure used
in Algorithm 2. We found a solution by distributing the dense points of each identified
maximal subspace from the hash table to a relevant file on the secondary storage. The
relevance is determined by the cardinality of the found subspace. If a set of points is
found in a maximal subspace of cardinality m, we can store these dense points in a file
named ‘m.dat’ along with their relevant subspaces. It will also facilitate running any fulldimensional algorithm like DBSCAN on each of these files as its parameters can be set
according to different dimensionality.
4.6
Summary
The generation of large and high-dimensional data in the recent few years has overwhelmed the data mining community. In this chapter, we have presented a scalable version of the SUBSCALE algorithm proposed in previous chapter. The scalable version
has performed far better when it comes to handling high-dimensional datasets. We have
experimented with up to 6144 dimensional data and we can safely claim that it will work
for larger datasets too by adjusting the splitf actor.
However, the main cost in the scalable SUBSCALE algorithm is the computation of
the candidate 1-dimensional dense units. In addition to splitting the hash table compu-
76
Chapter 4. Scalable subspace clustering
tation, the SUBSCALE has a high degree of parallelism as there is no dependency in
computing dense units across multiple dimensions. We exploit the parallelism in the algorithm structure of the SUBSCALE algorithm in the next chapter using OpenMP based
shared memory architecture.
Chapter 5
Parallelization
5.1
Introduction
The growing size and dimensions of the data these days have set new challenges for the
data mining research community [5]. Clustering is a data mining process of grouping similar data points into clusters without any prior knowledge of the underlying data distribution [46]. As discussed in chapter 2, the traditional clustering algorithms either attempt to
partition a given data set into predefined number of clusters or use full-dimensional space
to cluster the data [47]. However, these techniques are unable to find all hidden clusters
especially in high-dimensional data. The increase in the number of dimensions of data,
impede the performance of these clustering algorithms which are known to perform very
well with low dimensions.
As discussed in the previous chapters, data group together differently under the different subsets of dimensions, called subspaces. A set of points can form a cluster in a
particular subspace and can be part of different clusters or may not participate in any
cluster in other subspaces. Thus, it becomes imperative to find all hidden clusters in
these subspaces. The subspace clustering algorithms forms a branch of clustering algorithms which attempt to find clusters in all possible subset of dimensions of a given data
set [16, 49].
77
78
Chapter 5. Parallelization
Usually distance or density among the points is used to measure the similarity. Given
an n × k dataset, a data point is a k-dimensional vector with values measured against
each of the k dimensions. Two data points are said to be similar in a given subset of
dimensions (subspace) if the values of these points under each dimension participating in
this subspace are similar as per the similarity criteria.
Since a k-dimensional data set can have upto 2k − 1 possible axis-parallel subspaces,
therefore, the search space for subspace clustering becomes exponential in dimensions.
Subspace clustering is computationally very expensive process. Most of the relevant algorithms are inefficient as well as ineffective for high-dimensional data sets.
With the wider availability of multi-core processors these days, parallelization seems
to be an obvious choice to reduce this computational cost. There has been some work
in the literature for parallel algorithms in subspace clustering. But earlier subspace clustering algorithms have less obvious parallel structures. This is partially due to the data
dependence during the processing sequence.
In chapter 3, we proposed SUBSCALE algorithm which is a promising approach to
find the non-trivial subspace clusters without enumerating the data points [81]. This algorithm requires only k database scans for a k-dimensional data. In Chapter 4, we have
proposed the scalable version of the SUBSCALE algorithm. The widespread availability
of multi-core processors have fuelled our endeavour to parallelize the SUBSCALE algorithm and further reduce its time complexity. In this chapter, we present the modifications
in SUBSCALE algorithm to utilise the multiple threads through OpenMP framework [82].
The SUBSCALE algorithm first generates the dense set of points across all 1-dimensional
subspaces and then efficiently combine them to find the non-trivial subspace clusters. The
non-trivial subspace clusters are also called maximal clusters. If a set of data points forms
a cluster C in a particular subspace of d dimensions where k are the total dimensions and
d ≤ k, then this cluster will exist in all of the 2d subsets of this subspace [14]. Although
SUBSCALE algorithm does not generate any trivial subspace clusters, its time complexity
is still compute intensive due to the generation of the combinatorial 1-dimensional dense
79
Chapter 5. Parallelization
set of points. However, the compute time can be reduced by parallelizing the computation
of the dense units.
In this chapter, we focus on scalable SUBSCALE subspace clustering algorithm due
to the computational independence in the structure of this algorithm. We aim to utilise
the multi-core architecture to accelerate the SUBSCALE algorithm while providing the
same output as a sequential version. We investigate the runtime performance with upto 48
cores running in parallel. The experimental evaluation demonstrates the speedup of upto
the factor of 27. Our modified algorithm is faster and scalable for high-dimensional large
data sets.
In the next section we discuss some of the related literature. The section 5.3 gives the
background of SUBSCALE algorithm and our approach. In section 5.4, we analyse the
performance of parallel implementation and finally, the chapter is summarized in section
5.5.
5.2
Related work
Over the past few years, there has been extensive research in the clustering algorithms
[5, 36, 46].
One of the famous techniques to deal with high dimensionality is to reduce the number of dimensions by removing the irrelevant (or less relevant) dimensions, for example, Principal Component Analysis (PCA) transforms the original high-dimensional space
into low-dimensional space [83]. Since PCA preserves the original variance of the fulldimensional data during this transformation, therefore, if no cluster structure was detected
in the original dimensions, no new clusters in the transformed dimensions will be found.
Also, the transformed dimensions lack the intuitive meaning as it is difficult to interpret
the clusters found in the new dimensions in relation to the original data space. The significance of local relevance among the data with respect to the subset of dimensions, has
lead to advent of subspace clustering algorithms [16, 49].
80
Chapter 5. Parallelization
The projected clustering algorithms like PROCLUS [13] and FINDIT [56] require
users to input the number of clusters and the number of subspaces, which is difficult to
estimate for the real data sets. Hence, these algorithms are essentially data partitioning
techniques and cannot discover the hidden subspace clusters in the data.
The algorithms based on full-dimensional space like DBSCAN [10] are also ineffective for high-dimensional data sets due to the curse of dimensionality [81]. According to
the DBSCAN algorithm, a point is dense if it has τ or more points in its -neighbourhood
and a cluster is defined as a set of such dense (similar) points. Two point are said to be
in the same neighbourhood (similar) if the values under each of the corresponding dimension lies within distance. As mentioned in the previous section, a data point is a
vector in a k-dimensional space. But, for high-dimensional data, the clusters exist in the
subspaces of the data as two points might be similar in a certain subset of dimensions but
may be totally unrelated (or distant) in the other subset of dimensions. The underlying
premise that data group together differently under different subsets of dimensions opened
the challenging domain of the subspace clustering algorithms [16, 49, 50].
Agrawal et al. [15] were the first to introduce the grid-density based subspace clustering approach in their famous CLIQUE algorithm. The data space is partitioned into
equal-sized 1-dimensional ξ units using a fixed size grid. A unit is considered dense if
the number of points in it exceeds the density support threshold, τ . A subspace with k1
dimensions participating in it is called higher-dimensional than another subspace with
k2 dimensions in it if k1 > k2 . The lower-dimensional candidate dense units are combined together iteratively for computing higher-dimensional dense units (clusters), starting from the 1-dimensional units. There are many other variations of this algorithm,
e.g., using entropy [61] and adaptive grid [60]. Instead of using the grid, SUBCLU algorithm applied DBSCAN one each of the candidate subspaces [58] where DBSCAN is
an full-dimensional clustering algorithm. The INSCY algorithm [59] is an extension of
SUBCLU which uses indexing to compute and merge 1-dimensional base clusters to find
the non-trivial subspace clusters.
81
Chapter 5. Parallelization
The underlying premise that data group together differently under different subsets
of dimensions opened the challenging domain of subspace clustering algorithms [16, 50].
Although all of these subspace clustering algorithms can detect previously unknown subspace clusters, they fail for high-dimensional data sets. The inefficiency arises due to the
detection of redundant trivial clusters and an excessive number of database scans during
the clustering process.
The subspace clustering is a compute intensive task and parallelization seems to be an
obvious choice to reduce this computational cost. But, most of the subspace clustering
algorithms have less obvious parallel structures [15, 58]. This is partially due to the data
dependency during the processing sequence [84].
The SUBSCALE algorithm introduced in the previous chapters requires only k database
scans to process a k-dimensional dataset. Also, this algorithm is scalable with the dimensions and does not compute the trivial clusters as compared to the existing algorithms.
Even though this algorithm does not generate any trivial subspace clusters, its time complexity is still compute intensive due to the generation of the combinatorial 1-dimensional
dense set of points. The compute time can be reduced by computing these 1-dimensional
dense units in parallel.
In the next section, we briefly discuss the SUBSCALE algorithm and our modifications for parallel implementation.
5.3
Parallel subspace clustering
The increasing availability of multi-core processor these days further the expectations
for efficient clustering of high-dimensional data. Parallel Processing of data can help to
speed-up the execution time by sharing the processing load amongst multiple threads running on multiple processors or cores. However, the sequential process should be decomposed into independent units of execution so as to be distributed among multiple threads
running on separate cores or processors. Also, management of threads from their gen-
82
Chapter 5. Parallelization
eration to termination including inter-communication and synchronisation makes parallel
processing a complex task.
In this section, we discuss the parallel implementation of subspace clustering using
multi-core architectures in detail. Our extend the work done in previous chapters on the
SUBSCALE algorithm. We aim to further reduce the execution time by parallelizing the
compute intensive part of the SUBSCALE algorithm.
Before presenting our approach, some of the basic definitions and concepts are as
below:
Definitions
Let DB be a database of n × k points where DB : {P1 , P2 , . . . , Pn }. Each point Pi is a
k-dimensional vector {Pi1 , Pi2 , . . . , Pik } such that, Pid is the projection of a point Pi in the
dth dimension. A point refers to the data point from the dataset.
A subspace is the subset of the dimensions. For example, S : {r, s} is a 2-dimensional
subspace consisting of rth and sth dimension and the projection of a point Pi in this
subspace is {Pir , Pis }. The dimensionality of a subspace refers to the total number of
dimensions in it. A single dimension can be referred as a 1-dimensional subspace. A
subspace with dimensionality a is a higher-dimensional subspace than another subspace
with dimensionality b, if a > b. Also, a subspace S 0 with dimensionality b is a projection
of another subspace S of dimensionality a, if a > b and S 0 ⊂ S, that is, all the dimensions
participating in S 0 are also contained in the subspace S.
A subspace cluster Ci = (P, S) is the set of points P , such that the projections of these
points in subspace S, are dense. A cluster Ci = (P, S) is called a maximal subspace
cluster, if there is no other cluster Cj = (P, S 0 ) such that S 0 ⊃ S. According to the
Apriori principle [14], it is sufficient to find only the maximal subspace clusters rather
than all clusters in all possible subspaces. The reason behind this sufficiency is that a
dense set of points in higher-dimensional subspace is dense in all of its lower-dimensional
83
Chapter 5. Parallelization
projections. The lower-dimensional projections of a maximal cluster contains redundant
information.
Next, we give an overview of the SUBSCALE algorithm and also, highlight the research problem.
5.3.1
SUBSCALE algorithm
The SUBSCALE is a clustering algorithm to find maximal subspace clusters without generating the trivial lower-dimensional clusters. The projections of the dense points in the
maximal subspace cluster, will be dense in all single dimensions participating in this subspace. The main idea behind the SUBSCALE algorithm is to find the dense sets of points
(density chunks) in all of the k single dimensions, generate the relevant signatures from
these density chunks, and collide them in a hash table (hT able) to directly compute the
maximal subspace clusters. The Algorithm 12 explains these processing steps. We now
briefly describe the process of finding density chunks and the corresponding signatures.
1
2
3
4
5
6
7
8
Input: DB : n × k data, a set of n keys:K
Output: Dense points in maximal subspaces
Initialize a common hash table hT able.
for dimension j ← 1 to k do
Scan {P1j , P2j , . . . , Pnj } and find density chunks
for each density chunk do
create signatures and hash them to hT able
end
end
Collect all collisions from hT able to output dense points in maximal subspaces
Algorithm 12: SUBSCALE algorithm in brief
Density chunks
The SUBSCALE algorithm uses distance (L1 metric) based similarity measure to define
the density among the data points. Based on two user defined parameters and τ , a data
point is dense if it has atleast τ points within distance. The neighbourhood N (Pi ) of a
point Pi in a particular dimension d is a set of all the points Pj such that L1 (Pid , Pjd ) < ,
84
Chapter 5. Parallelization
P1
P2
P3
P4
Dimension d2
P5
P7
P9
P11
P12
P13
P14
Dimension d1
Figure 5.1: Figure adapted from [81] illustrates the lack of information about which 1dimensional clusters (dense units) will generate the maximal clusters.
Pi 6= Pj . Each dense point along with its neighbours, forms a density chunk such that
each member of this chunk is within distance from each other.
The smallest possible dense set of points is of size τ + 1, known as a dense unit. In a
t
particular dimension, a density chunk of size t will generate τ +1
possible combinations
of points to form the dense units. Some of these dense units may or may not contain
projections of higher-dimensional maximal subspace clusters. As we do not have prior
information of the underlying data distribution, it is not possible to know in advance that
which of these dense units are significant. Only possibility is, to check which of these
dense units from different dimensions contain identical points. As shown in Figure 5.1,
the projections of the points {P7 , P8 , P9 , P10 } in dimension d2 form a 1-dimensional cluster (Cluster3 ), but there is no 1-dimensional cluster in dimension d1 with the identical
points as Cluster3 , thus, the absence of a cluster in the subspace {d1 , d2 } with the points
{P7 , P8 , P9 , P10 }.
Signatures
To create signatures from the dense units, n random and unique keys made of integers
with large digits are chosen to create a pool of keys called K. Each of the n data points is
85
Chapter 5. Parallelization
=
Figure 5.2: Each signature node, corresponds to a dense unit and consists of the sum of
the keys in this dense unit, the data points contained in the dense unit and the dimensions
in which this dense unit exists.
mapped to a key from the keys database K on 1:1 basis. The sum of the mapped keys of
the data points in each dense unit is termed as its signature.
According to the observation 2 and 3 in the chapter 3, two dense units with equal
signatures would have identical points in them. Thus, collisions of the signatures across
dimensions dr , . . . , ds implies that, the corresponding dense unit exists in the maximal
subspace, S = {dr , . . . , ds }. We refer our readers to the chapter 3 for the detailed explanation and the proof of this concept. Each single dimension may have zero or more dense
chunks, which in turn generate different number of signatures in each dimension. Some
of these signatures will collide with the signatures from the other dimensions to give a set
of dense points in the maximal subspace.
Hashing of signatures
The SUBSCALE algorithm uses an hT able data structure similar to the hash table to
compute the dense units in the maximal subspaces. The hT able is a simple storage mechanism to store the information regarding the signatures of the dense units generated across
single dimensions. A signature node, SigN ode is used to store the information pertaining
to each dense unit (Figure5.2). Each SigN ode contains: sum of the keys of the corresponding dense unit, the data points in the dense unit, the dimensions in which this dense
unit was computed and the pointer to the next signature node, if any.
Figure 5.3 shows the hash table used in this chapter. The hT able consists of a fixed
number of slots (numSlots) and each slot can store one or more signature nodes. In
this chapter, we used modulo function to assign a slot to a SigN ode. Two or more
signature nodes with different sums may be allotted the same slot in the hT able, if the
modulo output is same for these nodes. The linked list can be used to store more than
86
Chapter 5. Parallelization
Figure 5.3: hT able data structure in SUBSCALE to store signatures and their associated
data from multiple dimensions. numSlots is the number of total slots available in the
hT able. Each slot may have 0 or more signature nodes stored in it.
one SigN ode at a slot. An hT able is thus, a collection of signature nodes. Two dense
units are said to be colliding if they have equal signatures, which is the sum value in its
signature node (Figure 5.2). When two dense units collide, an additional dimension is
appended in the SigN ode.
Memory and runtime cost
The value of the numSlots depends on the number of dense units being generated from
each density chunk, which in turn depends on the number of density chunks in each
dimension. We do not have any prior information about the underlying data distribution
or density. Even though the total number of dense units can be calculated by first creating
the density chunks in all of the single dimensions through the formula given in 8 above,
we do not know in advance that which of these dense units will collide and which would
not collide. When two signature nodes collide, we do not store the second node again,
just an additional dimension is appended to the first node.
To find the maximal subspace clusters, all possible dense units in all of the single
dimensions are required to be generated. The storage requirements for the total signatures generated from a data set can outgrow the available memory in the system to store
87
Chapter 5. Parallelization
hT able. It adds to the cost of time and memory requirements for the SUBSCALE algorithm. The sequential version of the SUBSCALE algorithm proposed splitting of hash
table computations to overcome this memory constraint.
The size of each dense unit is τ + 1. If K is the key database of n large integers, then
value of a signature generated from a dense unit would lie approximately between the
range R = (τ + 1) × min(K), (τ + 1) × max(K) , where min(K) and max(K) are the
smallest and the largest keys respectively. Also, if numSig d is the number of total signatures in a dimension d, then the total number of signatures in a k dimensional data set will
P
be totalSignatures = kd=1 numSig d . If memory is not a constraint then a hash table
with R slots can easily accommodate total signatures as typically, totalSignatures R.
Since memory is a constraint, the SUBSCALE algorithm splits this range R into multiple slices such that each slice can be processed independently using a separate and much
smaller hash table. The computations for each slice is not dependent of other slices. The
split factor called sp determines the number of splits of R and its value can be set according to the available working memory.
Thus, the computations of dense units in each single dimension as well as each single
slice can be processed independent of others. In the next section, we endeavour to use
these independence among dense units to reduce the execution time for the SUBSCALE
algorithm with multiple cores.
5.3.2
Parallelization using OpenMP
In the previous subsection, we briefly investigated the internal working of the SUBSCALE
algorithm and identified few areas which can be processed independently. Next, we discuss how OpenMP threads on multi-core architectures can be used to exploit the parallelism in the SUBSCALE algorithm.
88
Chapter 5. Parallelization
OpenMP
The rapid increase in the multi-core processor architectures these days, have stretched the
boundaries of computing performance. Using multi-threaded OpenMP platform, we can
leverage these multiple cores for parallel processing of data and instructions. OpenMP is a
set of complier directives and callable runtime library routines to facilitate shared-memory
parallelism [82]. The #pragma omp parallel directive is used to tell the compiler that the
block of code should be executed by multiple threads in parallel. We used OpenMP with
C and re-implemented the SUBSCALE algorithm to parallelize the code using OpenMP
directives.
Dimensions in parallel
The generation of signatures from the density chunks in each single dimension is independent of other dimensions. This observation makes the Step 2 of Algorithm 12, a
strong candidate for parallelization. As shown in Figure 5.4, we can divide the dimensions among the threads such that each thread will process its share of 1-dimensional data
points to compute the signatures. Each thread runs on a separate processing core and
dimensions can be distributed equally or unequally among the threads. If t is number of
threads being used, then t out of total k dimensions can be processed in parallel, assuming
t < k.
A hash table hT able is shared among the threads to store the signatures as soon as
they are generated (Figure 5.3). The information about the collisions among signatures
from the different dimensions is also stored in this hash table. The Algorithm 12 can be
modified to process the dimensions in parallel as in Algorithm 13. The heuristics can be
used to fix the number of slots in the hT able.
However, the problem with this set up is that every time a thread accesses the shared
hT able to hash a signature, it would have to take exclusive control of the required memory
slot. Without the mutual exclusive access, two threads with the same signatures generated
from two different dimensions, would overwrite the same slot of hT able. The overwriting
89
Chapter 5. Parallelization
--2
---
3
---
--1
Thread t
Thread i
Thread 1
---
k-1
k
Dimensions
Figure 5.4: Parallel processing of SUBSCALE algorithm. Each dimension is allocated a
separate thread and each thread compute the density chunks and its signatures independent
of other threads
would lead to lose of information on the maximal subspace related to this signature. The
maximal subspace of a dense unit can only be found by having the information about
which dimensions generated this dense unit.
The OpenMP provides a lock mechanism for the shared variables but its synchronisation adds to the overhead as well as thread contention. When dimensions are being processed in parallel, a large number of combinatorial signatures will be generated.
The number of signatures being mapped to same slot will depend on their sum value,
numSlots in the hT able and hashing function (modulo in this chapter) being used. The
smaller hash table would lead to frequent requests for exclusive access to the same slot
from different threads. It can be argued that a large hash table would result in decrease in
lock contention but then, the number of locks to be maintained would grow proportional
to the slots in the hash table. Also, the allowed total size of a hash table depends on the
available working memory. We discuss the results from this method in the section 5.4.
Slices in parallel
The sharing of a common hash table among threads is a bottleneck for the speed up
expected through parallel processing of dimensions. The signatures generated from all of
the dimensions need to be processed in order to identify which of these collide. Without a
shared data structure to hold these signatures, we would not know which other dimensions
90
Chapter 5. Parallelization
Input: DB : n × k data, a set of n keys:K
Output: Dense points in maximal subspaces
1 Initialize a common hash table hT able
2 numT hreads ← k
3 #pragma omp parallel num threads(numThreads) shared(DB, hT able, K)
4 {
5 #pragma omp for
6 for dimension j ← 1 to k do
7
Scan {P1j , P2j , . . . , Pnj } and find density chunks
8
for each density chunk do
9
create signatures and hash them to hT able in a mutually exclusive way
10
end
11 end
12 } Collect all collisions from hT able to output dense points in maximal subspaces
Algorithm 13: Modified SUBSCALE algorithm to execute multiple dimensions on
multiple cores and with a shared hash table
Input: DB : n × k data, a set of n keys:K, SP
Output: Dense points in maximal subspaces
1 numT hreads ← k
2 R ← (max(K) − min(K)) × (τ + 1)
R
3 SLICE =
SP
4 #pragma omp parallel num threads(numThreads) shared(DB, K)
private(LOW ,HIGH)
5 {
6 #pragma omp for
7 for dimension split ← 0 to SP − 1 do
8
Initialize a new hash table hT able
9
LOW = min(K) × (τ + 1) + split ∗ SLICE
10
HIGH = LOW + SLICE
11
for dimension j ← 1 to k do
12
Scan {P1j , P2j , . . . , Pnj } and find density chunks
13
for each density chunk do
14
create signatures between LOW and HIGH and hash them to hT able
15
end
16
end
17
Collect all collisions from hT able to output dense points in maximal subspaces
18
Discard hT able
19 end
20 }
Algorithm 14: Modified SUBSCALE algorithm to execute multiple slices on multiple
cores and separate hT able.
91
Chapter 5. Parallelization
are generating the same signature. Although we cannot split the hash table, we can split
the generation of signatures so that, only sums within a certain range are allowed in the
hash table at a time.
As discussed in 8 above, the SUBSCALE algorithm proposed splitting the range R
of expected signature values among multiples slices. Since these slices can be processed
independent of each other, multiple threads can process them in parallel as in Algorithm
14. Each slice requires a separate hash table. Though this approach helps to achieve
faster clustering performance from the SUBSCALE algorithm but the memory required
to store all of the hash tables can still be a constraint. Since R denotes the whole range
of computation sums that are expected during the signature generation process, we can
bring these slices into the main working memory one by one. Each slice is again split into
sub-slices to be processed with multiple threads as explained in Algorithm 15.
The results and their evaluation are discussed in the section.
5.4
5.4.1
Results and Analysis
Experimental setup
The experiments were carried out on the IBM Softlayer Server with 48 cores, 128 GB
RAM and Ubuntu 15.04 kernel. The hyper-threading was disabled on the server so that
each thread could run on a separate physical core and the parallel performance could be
measured fairly. The parallel version of the SUBSCALE algorithm was implemented in
C using OpenMP directives. The for loop directive #pragma omp parallel was used to
allocate work to the multiple cores. Also, we used 14-digit non-negative integers for the
key database.
92
Chapter 5. Parallelization
Input: DB : n × k data, a set of n keys:K,SP ,innerSP
Output: Dense points in maximal subspaces
1 numT hreads ← k
2 R ← ((max(K) − min(K)) × (τ + 1)
R
3 SLICE =
sp
4 for split ← 0 to SP − 1 do
5
LOW = min(K) × (τ + 1) + split × SLICE
SLICE
6
innerSLICE = innerSP
7
#pragma omp parallel num threads(numThreads) shared(DB, K)
firstprivate(LOW )
8
{
9
#pragma omp for
10
for dimension split ← 0 to SP − 1 do
11
Initialize a new hash table hT able
12
innerLOW = LOW + innersplit × innerSLICE
13
innerHIGH = innerLOW + innerSLICE
14
for dimension j ← 1 to k do
15
Scan {P1j , P2j , . . . , Pnj } and find density chunks
16
for each density chunk do
17
create signatures between innerLOW and innerHIGH and hash
them to hT able
18
end
19
end
20
Collect all collisions from hT able to output dense points in maximal
subspaces
21
Discard hT able
22
end
23
}
24 end
Algorithm 15: Modified SUBSCALE algorithm to execute multiple subslices on multiple cores
93
Chapter 5. Parallelization
5.4.2
Data Sets
The synthetic datasets may contain inherent bias for the underlying data distribution,
therefore, we used real data sets for our clustering experiments. The two main datasets
for this experiment: 4400 × 500 madelon dataset and 3661 × 6144 pedestrian dataset,
are publicly available. The madelon data is available at the UCI repository [69] and the
pedestrian dataset was created through the attributed pedestrian database [77, 78] using
the Matlab code in given in APiS1.0 [79].
5.4.3
Speedup with multiple cores
We compared the runtime performance of modified SUBSCALE algorithm using multiple
threads running in parallel on upto 48 cores. The first attempt was to compute the dense
units in all single dimensions in parallel.
Multiple cores for dimensions
We used 500 dimensional madelon data set with = 0.000001, τ = 3 and with these
parameters, the total signatures from all of the single dimensions in madelon dataset
were calculated at 29350693. The total number of signatures can be pre-calculated from
the dense chunks in all dimensions. Some of these signatures will collide in the common
hash table hT able, shared among the threads. As discussed in the previous subsection, the
shared hT able will eventually lead to memory contention whenever multiple threads try
to access the same slot of hT able simultaneously. Since the frequency of this contention
depends upon the number of slots in the hT able, thus, we experimented with three different number of slots in the shared hT able: 0.1 million, 0.5 million and 1 million.
The Figure 5.5 shows the results for runtime performance of madelon data set by
using multiple threads for dimensions. We can see that performance improves slightly by
processing dimensions in parallel but as discussed before, the mutual exclusive access of
the same slot of shared hash table results in the performance degradation.
94
Chapter 5. Parallelization
1000
Runtime (s)
800
100000 slots in hTable
500000 slots in hTable
1000000 slots in hTable
600
400
200
0
12 4
8
16
32
No. of threads
48
Figure 5.5: Dataset: madelon : 4400 × 500, Parameters: = 0.000001andτ = 3. Total
dimensions are distributed among threads and these threads run in parallel on separate
cores. Each thread computes the density chunks and its signatures independent of other
threads. One dimension per thread is processed at a time. Runtime measured with respect
to the number of threads and the number of slots in the hT able.
Multiple cores for slices
The next step is to avoid this memory contention which arises due to the simultaneous
access of the same slot of the shared hash table. This happens when signatures with
the same sum value or signatures with the different sum value but same hash output, are
generated simultaneously from different threads running on different dimensions. If the
threads could generate the signatures requiring different slots at all times, this memory
contention can be avoided.
We re-implemented the scalable version of the SUBSCALE algorithm using OpenMP
threads. But instead of running threads on dimensions, we ran them on slices created
using the split factor discussed before. This implementation does not require the use
of lock mechanism for shared access to memory. As discussed in 8,
R
sp
was used to
approximate the numSlots value for each hT able.
Figure 5.6 shows the results of the runtime versus the number of threads used for
processing the slices of the madelon data set. We used the same value of and τ as used
95
Chapter 5. Parallelization
Runtime (s)
800
sp=200
sp=500
sp=100
sp=1500
sp=2000
600
400
200
0
1 4
8
16
32
No. of threads
48
Figure 5.6: Dataset: madelon : 4400 × 500, Parameters: = 0.000001andτ = 3. The
slices are distributed among threads these threads run in parallel on separate cores. Each
thread computes the density chunks and its signatures independent of other threads. One
slice per thread is processed at a time. Runtime measured with respect to the number of
threads and the split factor sp. The overhead of using threads surpasses the performance
gain when only 4 slices are being processed by each core.
for the shared hT able with lock mechanism above. The hash computation was sliced with
different values of split factor sp ranging between 200 and 2000. These slices of the hash
computation were divided among multiple cores to be run by separate threads in parallel.
We can see the performance boost by using more number of threads. The speed up is
significant when there are more slices to be processed. Hence, multiple cores can reduce
the runtime significantly when more work needed to be done using large value of sp. The
speed up for the same experiment is shown in Figure 5.7, which becomes linear as the
number of slices increases.
Scalability with the dimensions
We were motivated by the results given by the madelon dataset, so we experimented
with the 6144 dimensional pedestrian dataset to study scalability and speed up with
higher number of dimensions. Using parameters of = 0.000001 and τ = 3, a total
of 19860542724 signatures in all single dimensions are expected from the pedestrian
96
Chapter 5. Parallelization
20
Speedup
15
10
sp=200
sp=500
sp=100
sp=1500
sp=2000
5
0
1 4
8
16
32
No. of threads
48
Figure 5.7: Speed up for the results give in Figure 5.6. As the number of slices increases,
the efficiency gain from multi-core architectures increases. With sp=200, the number of
slices per core can vary from 200 to 4, depending upon the number of threads.
dataset. Considering each entry in the hash table stores the tau + 1 dense points (say 16
bytes for τ = 3 on a typical computer), value of a large digit signature to be matched (8
bytes) and to store the dimensions being collided (16 bytes for an average of 2-dimensional
subspace). The total memory required to store an entry would be 40 bytes approximately.
Therefore, 19860542724 expected signatures would require ∼ 592 GB of working memory to store the hash tables. There would be additional requirements for memory to store
the temporary data structures being used during the computation process.
To overcome this huge memory requirement, we can split these signature computations two times. We used split factor of 60 to bring down the memory requirement for
total hash tables. Each of these 60 slices were further split into 200 subslices to be run
on multiple cores. The memory requirements for hT able are different for each slice. We
found in our experiments that different number of signatures were generated in different
slices.
As shown in Figure 5.8, the number of signatures seem to follow the famous bell curve
and relatively large number of signatures were generated in the middle of splitting. We
investigated the values of 14-digit keys which were generated randomly and mapped to
97
Chapter 5. Parallelization
No. of Signatures (Millions)
1000
padestrian dataset
800
600
400
200
0
0
10
20
30
Slice number
40
50
60
Figure 5.8: Dataset: pedestrian : 3661 × 6144, Parameters: = 0.000001 and τ =
3, sp = 60. Number of signatures being generated in each of the 60 slices. A large
number of signatures are generated towards the middle slice numbers
the 3661 points of the pedestrian dataset. The keys seem to follow no particular bias and
their values were randomly lying in the full space of range between 1.0E14 to 1.0E15
(Figure 5.9).
Instead of the user defined number of slots for hT able, as we did for madelon dataset,
we divided these total signatures by the split factor sp to approximate the memory requirement for each slice. The size of hT able was calculated as
totalSignatures
.
sp
We can see that
the execution time decreases drastically with increase in number of threads.
We used pedestrian dataset with parameters: = 1.0E − 6, τ = 3, outerSplit = 60,
innerSplit = 200. The total signatures expected were 19860542724 and so we divided
this number by outerSplit to declare a hT able with sizefig:keys 1655045. It took around
26 hours to finish processing total 60 slices and each slice being split into 200 sub slices
and processed in parallel with 48 threads. The sequential version of the SUBSCALE
algorithm has reported ∼ 720 hours to process this data.
98
Chapter 5. Parallelization
13
14−digit key values
10
x 10
8
6
4
2
0
500
1000
1500
2000
Data Point ID
2500
3000
3500
Figure 5.9: The distribution of values of 3661 keys used for pedestrian dataset. No two
keys are same and are generated from the full space of the 14-digit integer domain. Keys
are not generated in any particular ascending or descending order. These 3661 keys are
mapped one:one to the 3661 points
5.4.4
Summary
The SUBSCALE algorithm introduced in chapter 3 and 4 can find non-trivial clusters in
high-dimensional data sets. However, the time complexity of the SUBSCALE algorithm
and its scalable version depends heavily on the computation of 1-dimensional dense units.
To further reduce the computational complexity, parallelization is the only choice.
In this chapter, we have used largely available shared memory multi-core architectures
to parallelize the SUBSCALE algorithm. We have developed and implemented various
approaches to compute the dense units in parallel. The results with upto 6144 dimensions
have shown linear speed up. In future, we aim to utilise the General Purpose Graphic
Processing Units to further speed up the execution time of this algorithm.
Chapter 6
Outlier Detection
6.1
Introduction
With the evolution of information technology, increasingly detailed data is being captured
from a wide range of data sources and mechanisms [3]. While additional details about the
data increase the number of dimensions, the consolidation of data from different sources
and processes can lead to wider possibilities of introduction of errors and inconsistencies
[85]. In addition to the need for better data analysis tools, concerns about data quality
have also grown tremendously [86, 87].
The real data is often called ‘dirty data’ or ‘bad data’ as it inevitably contains anomalies like wrong, invalid, missing or outdated information [88]. The anomalies are basically
the abnormal values in the data and are also known as outliers. The outliers can arise from
an inadequate procedure of data measurement and collection, or an inherent variability in
the underlying data domain. The presence of outliers can have disproportionate influence
on data analysis [89].
Data analysis is a foundation of any decision making process in a data-driven application domain. Poor decisions propelled by poor data quality can result in significant social
and economic costs including threat to national security [90–92]. In 2014, US postal service lost $1.5 billions due to wrong postal addresses [93]. The widespread impact of poor
99
100
Chapter 6. Outlier Detection
quality data is also revealed from a recent report which says that 75% of companies waste
an average of 14% of revenue on bad data [94]. In some of the critical areas like health
sector, poor data quality can lead to wrong conclusions and can have life-threatening
consequences [95, 96]. Evidence-Based Medicine (EBM) is the process of using clinical
research findings to aid clinical diagnosis and decision making by the clinician [97]. Although EBM is increasingly being used for clinical trials, the quality of patient outcomes
depends upon the quality of data [98]. In addition to EBM, the quality of health care data
also plays an important role in scheduling and planning hospital services [99].
Nonetheless, the quality of data depends upon the context in which it is produced or
used [100]. The broader meaning of data quality has evolved from the term ‘fitness for
use’ proposed in a quality control handbook by Juran [101]. Although, efforts have been
made to define data quality in terms of various characteristics like accuracy, relevance,
timeliness, completeness, and consistency [102, 103], there is no single tool which can
solve all of the the data quality problems. In fact, the problem of ‘data quality’ is multifaceted and usually requires domain knowledge and multiple quality improvement steps
[104–106].
Data cleaning, also known as scrubbing or reconciliation or cleansing, is an inherent
part of data preprocessing used by data warehouses in order to improve data quality [5].
Maletic and Marcus [107] enumerated the steps for data cleansing process which includes
identifying the anomalous data points and applying appropriate corrections or purges to
reduce such outliers. The domain experts usually intervene in the cleaning process because their knowledge is valuable in identification and elimination of outliers [108]. Additionally, a significant portion of data cleaning work has to be done manually or by
low-level programs that are difficult to write and maintain [85, 86]. Needless to say, data
cleaning is a time consuming and expensive process. According to Dasu and Johnson [85],
80% of the total time spent by a data analyst is only on the data cleaning part.
The increase in high-dimensional data these days poses further challenges for data
cleaning. The main reason is that the outliers in a high-dimensional data are not as obvi-
101
Chapter 6. Outlier Detection
ous as a univariate or even low-dimensional data. The normal and abnormal data points
exhibit shared behaviour among multiple dimensions. The problem is further exaggerated
by the surprising behaviour of distance metrics in higher dimensions known as the curse
of dimensionality (also, discussed in chapter 1) [8, 11]. The state-of-the-art traditional
methods do not work for outlier detection in the high-dimensional data [37, 109, 110].
In this chapter, we focus our interest on data cleaning aspect through efficient identification of outliers in high-dimensional data. We also endeavour to characterise each
outlier with a measure of outlierness, which can aid the analyst to make an informed
decision about the outlier.
In the next section, we discuss the issues pertaining to outliers while cleaning highdimensional data. We discuss related outlier detection methods in section 6.3. In section
6.4, we propose our approach to deal with the high-dimensional data for outlier detection.
6.2
Outliers and data cleaning
The detection and correction of anomalous data is the most challenging problem within
data cleaning. According to Hawkins [111], an outlier or an anomaly is an observation
that deviates so much from the rest of the observations so as to arouse suspicion about its
origin. Quite often, outliers skew the data or bring an other dimension of complexity into
data models, making it difficult to accurately analyse the data. Outliers may be of interest
for several other reasons too, for example, apart from data cleaning, outlier detection has
enormous applications in fraud detection, criminal activities, gene expression analysis,
and environmental surveillance.
There are different ways to handle univariate outliers. Statistical methods are very
common for data cleaning which are based on the Chebynsev theorem [112]. The points
beyond a certain standard deviation are termed as outliers using a confidence interval.
However, univariate or even low-dimensional outliers are usually obvious and can be
detected through visual inspection or using traditional approaches.
102
Chapter 6. Outlier Detection
Table 6.1: Outlier removal dilemma
Data
points
P1
P2
P3
P4
Dimensions
d1
d2 d3
11
50 60
10
52 63
12 250 62
101 49 03
But for high-dimensional data (also called multivariate data), the outliers are hidden
in the underlying subspaces. In Table 6.1, a data point P3 seems to have an abnormal
value in dimension d2 which might be representing age of a person. However, point P3
appears normal under subspace d1 , d3 . Similarly, the data point P4 has abnormal values in
dimensions d1 and d3 but appears to be normal in dimension d2 . The outlierness of points
P3 and P4 is still observable in this 3-dimensional dataset. But the detection of such
outliers becomes very challenging for high-dimensional data as the number of possible
subspaces becomes exponential with the increase in dimensions.
Moreover, analysts are frequently faced with the dilemma of what to do with an outlier.
In many cases, the available information and knowledge is insufficient to determine the
correct modification to be applied to the outlier data points. On the one hand, removal
of outliers may greatly enhance the data quality for further analysis and can be a cheaper
practical solution than fixing them. But, on the other side, deletion of a data point Pi
detected as an outlier can lead to a loss of information if Pi is not an outlier in all of
the dimensions. This loss of information can be avoided by getting additional details
about this point, for example, the number of subspaces this point is showing an outlying
behaviour. The ranking of data points in the order of their outlierness also helps to focus
on important outliers and deal with them accordingly.
Both clustering and outlier detection are based on the notion of similarity among the
data points. The clusters are the points lying in the dense regions while outliers are the
points lying in the sparse regions of the data. Similar to clustering, state-of-the-art tradi-
103
Chapter 6. Outlier Detection
tional distance or density methods do not work for outlier detection in high-dimensional
data [37, 109, 110]. These methods look for outliers using all of the dimensions simultaneously. But due to the curse of dimensionality, all data points appear to be equidistant
from each other in the high-dimensional space. The notion of proximity fails in the sparse
high-dimensional space and every point appears to be an outlier. The outliers are complex in the high-dimensional data as the points are correlated differently under different
subsets of dimensions.
Referring to Figure 1.1 from Chapter 1, a data point can be a part of a cluster in some
of the subspaces while it can exist as an outlier in the rest of the subspaces. Due to the
exponential growth in the number of subspaces with the increase in dimensions of the
data, finding outliers in all subspaces is a computationally challenging problem. There
is an exigent need for efficient and scalable outlier detection algorithms for the highdimensional data [37, 113]. In this chapter we focus on utility of outlier detection in the
data cleaning applications.
In addition to detecting outliers, it is important and useful to further characterize each
outlier with a measure of its outlierness in the form an outlier score. The outlier score can
reveal the interestingness of an outlier to the data analyst. Most of the outlier detection
algorithms work as a labelling mechanism to give a binary decision of a data point being
an outlier or not [114]. Scoring and ranking the outliers can aid better understanding of
the behaviour of outliers with respect to the rest of the data and can aid the data cleaning
process.
In the previous chapters, we proposed algorithms to find clusters embedded in the
subspaces of the high-dimensional data sets. In this chapter, we utilize these algorithms
to discover outliers embedded in the subspaces of the data. We also propose further
characterisations of these outliers through their outlying score. Before discussing our
approach, we survey the current state of outlier detection research in the next section,
104
6.3
Chapter 6. Outlier Detection
Current methods for outlier detection
There has been significant research work in the outlier detection area as detailed in the
recent literature surveys [37, 110, 114].
Historically, the problem of outlier detection has been studied extensively in statistics by categorizing data points with low probability distribution as outliers especially by
Barnett and Lewis [112]. However, this approach requires a prior knowledge of the underlying distribution of the data set, which is usually unknown for most large data sets. In
order to overcome the limitations of the statistical-based approaches, distance and density
based approaches were introduced [115, 116]. Still most of the work in outlier detection
deals with low-dimensional data only.
6.3.1
Full-dimensional based approaches
Knorr [115] suggested a distance-based approach such that the objects with less than
k neighbors within distance λ were outliers. Its variant was proposed by Ramaswamy
et al. [117] which takes the distance of an object to its k th nearest neighbor as its outlier
score and retrieve the top m objects having the highest outlier scores as the top m outliers.
In the same year, Breuing et al. [118] proposed to rank outliers using local outlier factor
(LOF) which compares the density of each object of a data set with the density of its
k-nearest neighbors. A LOF value of approximately 1 indicates that the corresponding
object is located within a cluster of homogeneous density. The higher the difference of
the density around an object compared to the density around its k-nearest neighbors, the
higher is the LOF value that is assigned to this object.
Later on, improvements over these outlier ranking schemes were proposed [119–121]
but again they are based on the full dimensional space and face the same data sparsity problem in higher dimensions. Most proposed approaches so far which are explicitly or implicitly based on the assessment of differences in Euclidean distance metric between objects in full-dimensional space, do not work efficiently [122]. Some re-
105
Chapter 6. Outlier Detection
searchers [116, 123] have used depth based approaches from computer graphics where
objects are organized in convex hull layers expecting outliers with shallow depth values.
But these approaches too fail in the high-dimensional data due to the inherent exponential
complexity of computing convex hulls. Kriegal et al. [122] have used variance of angles
between pairs of data points to rank outliers in high-dimensional data. If the spectrum
of observed angles for a point is broad, the point will be surrounded by other points in
all possible directions meaning the point is positioned inside a cluster and a small spectrum means other points will be positioned only in certain directions, indicating that the
point is positioned outside of some sets of points that are grouped together. However,
the method cannot detect outliers surrounded by other points in subspaces and the naive
implementation of the algorithm runs in O(n3 ) time for a data set of n points.
These traditional outlier ranking techniques using outlierness measures in full space
are not appropriate for outliers hidden in subspaces. In the full space all objects appear
to be alike so that traditional outlier ranking cannot distinguish the outlierness of objects
any more. An object may show high deviation compared to its neighbourhood in one
subspace but may cluster together with other objects in a second subspace or might not
show up as an outlier in a third scattered subspace [124].
6.3.2
Subspace based approaches
The problem of outlier detection in subspaces has been mostly neglected by the research
community so far. Although it is important to look into the subspaces for interesting and
potentially useful outliers, the number of possible subspaces increases exponentially with
increase in the number of dimensions. However, some authors [125, 126] have contended
that not all attributes/dimensions are relevant for detecting outlying observations.
Pruning the subspaces
The complexity of the exhaustive search for all subspaces is 2k , where k is the data dimensionality. Our aim is to detect outliers in high-dimensional data by choosing the relevant
106
Chapter 6. Outlier Detection
subspaces and then pruning the objects so as to minimize calculations for every object
in the selected subspaces. One approach to deal with high-dimensional data is dimensionality reduction techniques like PCA (Principal Component Analysis) which map the
original data space to a lower-dimensional data space. However, these methods may be
inadequate to get rid of irrelevant attributes because different objects show different kind
of abnormal patterns with respect to different dimensions. To reduce the search space, we
rely on downward closure property of density enabling an Apriori-like search strategy.
The subspaces can be pruned w.r.t. outliers based upon the properties:
Property a. If an object is not an outlier in a k-dimensional subspace S, then it cannot be
an outlier in any subspace that is a subset of S.
Property b. If an object is an outlier in a k-dimensional subspace S, then it will be an
outlier in any subspace that is a superset of S.
Knorr and Ng [127] have proposed algorithms to identify outliers in subspaces instead
of the full-attribute space of a given data set. Their main objective was to provide some
intentional knowledge of the outliers, that is, the description or an explanation of why an
identified outlier is exceptional. For example, what is the smallest set of attributes to explain why an outlier is exceptional? Is this outlier dominated by other outliers? Aggarwal
et al. [128] then proposed a grid-based subspace outlier detection approach and they used
sparsity coefficient of subspaces to detect outliers and used evolutionary computation as
the subspace search strategy. Recent approaches have enhanced subspace outlier mining by using specialized heuristics for subspace selection and projection [109, 129, 130].
Muller and Schiffer [124] have approached this problem of subspace based outlier detection and ranking by first pruning the subspaces which are uniformly distributed and then
ranking each object in the remaining subspaces using a kernel density estimation.
Although recent work by Muller et al. [124] is a step towards subspace based outlier
detection and ranking, it has its own limitations, for example, they reject few subspaces
completely but then they calculate density for each and every object in the remaining
107
Chapter 6. Outlier Detection
subspaces. Our aim is to efficiently prune subspaces as well as objects in the remaining
subspaces.
However, most existing approaches suffer from the difficulty of choosing meaningful
subspaces as well as the exponential time complexity in the data dimensionality.
6.4
Our approach
Our aim is to find outliers embedded in all possible subspaces of the high-dimensional
data and then to efficiently characterize them by measuring their outlierness. Technically, exploring the exponential number of subspaces of high-dimensional data to detect
relevant outliers is a non-trivial problem. The exhaustive search of the multi-dimensional
space-lattice is computationally very demanding and becomes infeasible when the dimensionality of data is high.
In the previous chapters, we have tackled the problem of subspace clustering and our
proposed SUBSCALE algorithm is quite efficient and scalable with the dimensions. It is
desirable that we utilise our already established technique to solve the problem of outlier
detection and ranking in high-dimensional data.
Our work is motivated by the following observations:
1. In a high-dimensional space, due to the curse of dimensionality, each data point
is far away from each other and thus, it is difficult to find outliers using fulldimensional space. However, data points show interesting correlations among each
other in the underlying subspaces.
2. For a k-dimensional data, there are 2k − 1 subspaces to be searched for each data
point, which is a computationally expensive task. So, efficient pruning of subspaces
as well as the data points is needed. Most of the literature on subspace pruning is
based on heuristic measures. But, this random selection of subspaces is bound to
generate random identification and ranking of outliers, giving poor results. We aim
108
Chapter 6. Outlier Detection
at developing efficient and meaningful measures rather than heuristic selections of
outliers.
3. We have no prior information about the underlying data distribution and the significant dimensions for detecting outliers. So, we focus our interest towards solving
this problem using the unsupervised density based methods especially subspace
clustering. Clustering, also known as unsupervised learning, distinguishes dense
areas with high data concentration from sparse areas. As outliers have low density
around them, we can explore these sparse areas.
4. Measuring the outlierness of an object is very important rather than just labelling
it as an outlier or inlier. We aim at providing better ranking of the outliers based
on their behaviour in different subspaces. Keeping in view the utilisation of outlier
detection for data cleaning, we endeavour to aid the process of improving data
quality through the outlier score of each data point.
5. We need to adapt our outlier detection technique according to the dimensionality of
the subspaces. As the dimensionality increases, density of nearest neighbours too
decreases, so our algorithm would be able to adjust the parameters accordingly.
6.4.1
Anti-monotonicity of the data proximity
According to the downward closure of the dense regions proposed by Agrawal et al. [15]
in their Apriori search strategy, a data point from a dense cluster in a subspace S will
be a part of a dense region in all lower-dimensional projections of S. If we consider the
anti-monotonicity property of the same principle, we can infer that an object which is an
outlier in a subspace S will be an outlier in all higher-dimensional subspaces which are
supersets of the subspace S.
Consider distS (P1 , P2 ) as a proximity distance function between two points, P1 and
P2 in a subspace S. (The similarity measures were discussed in chapter 1). If S1 and
S2 are two different subspaces and subspace S2 is a superset of subspace S1 , that is,
109
Chapter 6. Outlier Detection
Pi is an outlier
Pi is not an outlier
{1,2,3,4,5}
{1,2,3,4}
{1,2,3}
{1,2,3,5}
{1,2,4}
{1,2,4,5}
{1,2,5}
{1,4}
{1,5}
{1,3,4}
{2,3}
{1,3,4,5}
{1,3,5}
{2,4}
{2,4}
{4}
{1}
.....
{5}
Figure 6.1: Outlier in trivial subspaces.
S2 contains all the dimensions from subspace S1 , then following property holds for the
downward closure of search space for outliers:
distS2 (P1 , P2 ) ≥ distS1 (P1 , P2 ) ⇐⇒ S2 ⊃ S1
(6.1)
Thus, the distance between two points will not become any shorter as we move from a
lower-dimensional subspace S1 to a higher-dimensional subspace S2 where S2 contains all
dimensions of S1 and some more. This is the reason anti-monotonicity holds for outliers
in multi-dimensional space.
In the example given in Figure 6.1, a data point Pi first appears as an outlier in a subspace {1, 3} and is an outlier in all supersets of {1, 3} as shown by the shaded subspaces.
The shaded subspaces contain redundant information about the outlier and are trivial.
110
6.4.2
Chapter 6. Outlier Detection
Minimal subspace of an outlier
We discussed in the previous subsection that if a point Pi is an outlier in a subspace S
then it is also an outlier in all higher-dimensional subspaces, S 0 where S 0 ⊃ S. Each
such subspace S 0 is known as a trivial subspace while S is a non-trivial subspace. If |S|
denotes the number of dimensions in a subspace S, then we define a minimal subspace
with respect to an outlier point Pi as follows:
Definition 7 (Minimal subspace of an outlier).
i
is a minimal subspace with respect to an outlier point Pi if there exists
A subspace Smin
no other subspace S 0 where Pi appears as an outlier and dimensions in S 0 are a proper
subset of dimensions in S, 1 ≤ |S 0 | < |Smin |. Alternatively, S is the minimal subspace
for a data point Pi if for all lower-dimensional subsets of S, Pi appears as a part of some
dense region. A data point Pi can appear as an outlier in many subspaces which fulfil the
condition for minimal subspaces for this point. Let us denote the set of minimal subspaces
i
i
for a data point Pi as Smin
. The subspaces in Smin
can be either partially-overlapping
or non-overlapping with each other but none of the subspace is a complete subset of the
other.
In Figure 6.1, a point Pi can appear as an outlier for the first time in subspaces S1 =
{1, 3}, S2 = {1, 5} and S3 = {2, 3, 4}. All of these three subspaces are minimal for an
outlier point Pi . We notice that S1 ∪ S2 ∪ S3 = full dimensional space. Thus, no further
minimal subspace will exist for the data point Pi .
Detecting outliers in the minimal subspaces
Our interest in finding minimal subspaces of each outlier is based on the intuitive idea that
the cardinality of Smin gives an indication about the outlying behaviour of a data point.
Observation 4. If m is the number of dimensions in a minimal subspace Smin such that
m = |Smin | and m ≤ k, then a smaller value of m means that Pi is showing outlying
111
Chapter 6. Outlier Detection
behaviour in a larger number of subspaces. Typically, Pi will show outlying behaviour in
all of 2k−m − 1 higher-dimensional subspaces.
The SUBSCALE algorithm detects the maximal subspace for each dense unit of
points. If Smax is a maximal subspace for a set of dense units, it means that the points in
these dense units will not appear together in the next higher subspaces which are supersets
of S. However, the points in these dense units can appear as an outlier or participate in
other dense units with other points in higher-dimensional subspaces. The behaviour of
the dense points from subspace Smax in superset of subspaces will totally depend upon
the underlying density distribution of these points in these subspaces.
For example, we assume that two dense units Ua = {P1 , P2 , P3 , P4 } and Ub = {P1 , P5
, P6 , P7 } exist in dimensions d1 and d2 . Along with these two, suppose the dense unit
Ub also exists in dimension d3 . Using SUBSCALE algorithm, the dense unit Ua will be
detected in the maximal subspace {d1 , d2 } while the dense unit Ub will be detected in
the maximal subspace {d1 , d2 , d3 }. Here, point P1 is part of another dense unit in higher
dimensional subspace {d1 , d2 , d3 }.
Starting with the 1-dimensional dense units, the points which do not participate in a
dense unit in a particular dimension are the 1-dimensional outliers. These outlier points
are easy to detect from 1-dimensional dense units. For example, if there are 7 points in
total out of which 2 points fail to participate in any of the 1-dimensional dense units in a
dimension dj , then dj is the minimal subspace for these 2 outlier points.
Some of the dense points from 1-dimensional dense units might not participate in any
of the 2-dimensional dense units. These will be outliers in the 2-dimensional space of the
data. These 2-dimensional outliers were not detected in single dimensions and appear for
the first time in the 2-dimensional subspaces. But, it is hard to detect these outliers from
the 2-dimensional maximal subspaces of the clusters given by the SUBSCALE algorithm.
As discussed in the previous example, from maximal subspace {d1 , d2 }, we only know
about dense unit Ua = {P1 , P2 , P3 , P4 } and cannot deduce that the remaining points
(DB − {P1 , P2 , P3 , P4 }) will be outliers in the subspace {d1 , d2 }. The reason is that some
112
Chapter 6. Outlier Detection
of the remaining points might participate in other dense units, which also exist in the additional single-dimensions and thus, would show up as the higher-dimensional dense units,
for example Ub in this case. Thus, given a maximal subspace cluster, it is difficult to find
outliers directly.
One solution is to take the projections of all maximal subspaces in their lower-dimensional
subspaces (if they exist). In the previous example, since {d1 , d2 } is a projection of subspace {d1 , d2 , d3 }, we can concatenate the dense points from dense units Ua and Ub . Thus
DB − (P1 , P2 , P3 , P4 , P5 , P6 , P7 ) are the outliers in the subspace {d1 , d2 }. But we cannot
say that {d1 , d2 } is a minimal subspace for these outliers. Because some of the outliers
would have made their first appearance in the lower-dimensional subspaces (single dimensions in this case). To reiterate, a subspace is a minimal subspace of an outlier, if a
data point has appeared for the first time in this subspace as an outlier and it is not an
outlier in all lower-dimensional projections of this subspace.
In each of the detected maximal subspaces S, the set of points which are not part of
the cluster in this subspace are the outliers. Let us denote this set of outlier points as
O. Some of these outliers will be old outliers O0 showing up from lower-dimensional
subspaces such that each such lower-dimensional subspace S 0 is a subset of S. Thus, the
subspace S will be a minimal subspace for the points in O − O0 .
Ranking outliers using minimal subspaces
The ranking decision of an outlier cannot be taken until all outliers have been discovered
in all possible minimal subspaces. A score needs to be assigned to each outlier in each of
the relevant minimal subspace. The scores can be accumulated for each outlier to find its
total score, which decides the rank of this outlier with respect to the other data points.
The number of subspaces in which an object is showing outlying behaviour, contributes to the strength of its outlyingness. The more the number of subspaces in which an
outlier appears, stronger it should be weighed. Thus, following observation 4, an outlier
which was first detected in lower-dimensional subspaces should have bigger score than
113
Chapter 6. Outlier Detection
the outlier which was first detected in higher-dimensional subspaces. Another argument
is that due to the curse of dimensionality, the probability of data existing in the clusters is
more in the lower-dimensional subspaces. Therefore, if a data point is not able to group
together with other data points in the high-probability subspaces then it should be given
higher score as an outlier.
Based on observation 4, we can use the number of dimensions in the minimal subspace
of an outlier as a measure of the score. Let us assume that in addition to the minimal
subspace S1 = {1, 3} (as shown in figure 6.1), the point Pi also exists as an outlier
in a minimal subspace S2 = {1, 5}. Since Pi is an outlier in S1 , it will be an outlier
in 25−2 − 1 = 7 higher-dimensional superset subspaces. Similarly, with reference to
subspace S2 , Pi is again expected to be an outlier in 7 subspaces which are supersets
of S2 . We can assign a score of 7 + 7 = 14 to point Pi for both subspaces S1 and S2 .
But there is a problem in this approach. There will be some common subspaces which
contain all dimensions of S1 ∪ S2 , for example, {1, 3, 5} and {1, 3, 4, 5}. In this particular
example, there will be 25−3 = 4 redundant subspaces. So the correct outlier score for the
point Pi will be 14 − 4 = 10.
It is possible that an outlier which was discovered in many higher-dimensional subspaces might end up having total score more than the outlier which was discovered in
only few lower-dimensional subspace. The total score can only be found by applying
corrections for the common subspaces between peer minimal subspaces (defined below),
for example S1 and S2 in above case.
For example, in Figure 6.1, S1 = {1, 2, 3} and S2 = {1, 2, 4} are peer subspaces as
both are 3-dimensional subspaces with different sets of attributes. But, there will be some
common subspaces which contains all dimensions of S1 and S2 .
Definition 8 (Peer subspaces).
We define peer subspaces as the subspaces which have the same number of dimensions
but have atleast one dimension different from each other.
114
Chapter 6. Outlier Detection
To calculate the total outlier score of each point, all of its peer minimal-subspaces
should be first found and then the correction for the common subspaces is applied. This
i
process involves matching every subspace from Smin
with the other subspaces.
We notice that the process of ranking outliers using minimal subspace theory is a two
step process. The first step is to find the minimal subspace of each point by processing the
old and new outliers as discussed in section 6.4.2. The second step is to process the set
i
Smin
for each point Pi and find the total score of each outlier. Each data point will have
its own set of minimal subspaces where the number and sizes of these minimal subspaces
will be different for each point. Thus, this approach seems to include the computational
expense of firstly, keeping a track of old outliers while calculating minimal subspaces,
then, matching each and every subspace in the minimal subspace set of each point.
An alternative approach is to score the data points using the inliers that is the points
which are not outliers. In the next section, we introduce the concept of maximal subspace
shadow of an outlier.
6.4.3
Maximal subspace shadow
A data point can appear as an outlier in some of the subspaces while it can appear as an
inlier in other subspaces. An inlier means that a data point is part of some dense unit (or
cluster) in other subspaces. We define maximal subspace shadow of a data point Pi as the
maximal subspace S up to which it could survive without showing up as an outlier. Pi
will cease to be part of a cluster in any subspace S 0 which is a superset of S.
In a subspace hierarchy, the higher a data point can rise without being an outlier,
weaker it will become as an outlier. The maximal subspace shadow S of an outlier point
Pi is like a shadow of an outlier which was dense until subspace S and the shadow will
no longer exist in all supersets of subspace S. It is important to note that like minimal
subspaces, a data point can have many maximal subspace shadows existing in different
subspaces. Also, none of these shadow subspaces related to the same point will be superset or subset of the other.
115
Chapter 6. Outlier Detection
The maximal subspace shadow is easier to calculate than the minimal subspace for
each outlier. As we already have the SUBSCALE algorithm which directly finds the maximal subspace clusters, some of the data points in these clusters would never appear as a
part of some other cluster in the superset maximal subspaces. For each maximal subspace
S found by the SUBSCALE algorithm, we can iterate through each of the supersets S 0 of
the subspace S. Due to the Apriori principle, all dense points in the maximal subspace S 0
are also dense in the lower-dimensional subset subspace S. Once we remove those points
from S which also exist in S 0 , we have the set of points whose maximal subspace shadow
is S.
Once we have calculated the maximal subspace shadows of each data point, we assign scores to it. Algorithm 16 shows the steps to calculate the rank of all points using
the SUBSCALE Algorithm. We use the size of each of the detected maximal subspace
shadows to assign outlier score to a point. A higher score is assigned to a point whose
maximal subspace shadow lies in a lower-dimensional subspace than the point whose
maximal subspace shadow lies in a higher-dimensional subspace. The number of dimensions in the maximal subspace shadow count towards the scoring. For example, if a point
P1 has a maximal subspace shadow S : {d1 , d2 } then its score is increased by k −2, where
2 is dimensionality of S.
A k-dimensional data can have subspaces of 1, 2, 3, 4, . . . k dimensionality. Since
k
k
the number of subspaces are higher toward the start
,
, . . . and the end
2
3
k
k
,
, . . . of this dimensionality set, we normalise the score calcuk−2
k−3
lation for each maximal subspace shadow of dimensionality r by dividing the score by
k
as in step 16 of Algorithm 16.
r
Also, if a data point exists as an outlier in a 1-dimensional subspace, that is, it is not
present is any of the core set created using -neighbourhood, then it should be strongly
scored. We penalise such points by adding 1 to the rank for each single dimension they
appear as an outlier, but it can be set to any other high value as well. These points are said
to have a maximal subspace shadow of size zero.
116
Chapter 6. Outlier Detection
Input: n: total number of data points; k: total number of dimensions; Clusters:
set of clusters in their maximal subspaces.
Output: Rank: rank of each of the n points. The higher the score, stronger a point
is as an outlier.
/* Initialise ranks for all points to 0.
1
2
3
4
5
6
7
8
9
*/
for i ← 1 to n do
Rank[Pi ] ← 0
end
for j ← 1 to k do
Find core-sets in dimension j using Algorithm 8 with density threshold
parameters and τ
for each point Pi not participating in any of the core-sets in dimension j do
Rank[Pi ] ← 1
end
end
/* Each entry in the Clusters is < P, S > where, P is a set of points
which are dense in a subspace S. Clusters are found by using
SUBSCALE algorithm discussed in previous chapters.
*/
10
11
12
13
14
15
16
17
for each entry < P, S > in Clusters (including all 1-dimensional clusters) do
X ← null
for each entry < P 0 , S 0 > in Clusters where S 0 ⊃ S do
Append P 0 to X
end
P ← P − (P ∩ X)
for Pi ∈ P do
k
Rank[Pi ] ← Rank[Pi ] + (k − |subspace|)/
|subspace|
/* |subspace| is the number of dimensions in the subspace.
end
19 end
Algorithm 16: Rank outliers: Rank the outliers based on SUBSCALE algorithm
18
*/
117
6.5
Chapter 6. Outlier Detection
Experiments
We experimented with four different data sets: shape (160×17), Breast Cancer Wisconsin
(Diagnostic) (569 × 30), madelon (4400 × 500), and Parkinsons disease (195 × 22). The
shape dataset is taken from OpenSubspace Project page [67] and the rest of the datasets
are freely available at UCI repository [69, 131].
We used the SUBSCALE algorithm to find clusters in all possible subspaces for each
dataset. Then, we calculated the maximal subspace shadows of the data points using
Algorithm 16. We evaluated the outlier scores of each data point with different -values.
When we increase the -value, due to increase in the neighbourhood radius, more points
will be packed in the clusters. Thus, we expect the overall outlier scores to drop with the
bigger value of parameter. This is evident from the graphs as shown below.
Figure 6.2 shows the outlier score of a small 17-dimensional shape dataset with three
values: 0.01, 0.02, 0.03. The overall outlier score ranges between 0 and 17. A data point
with 0 outlier score means that it didn’t appear as an outlier in any of the subspaces and is
part of some cluster in the k-dimensional higher subspace. It also implies that there will
be atleast τ more points with 0 score.
Figure 6.3 shows the outlier ranking for 22-dimensional Parkinsons disease dataset of
195 points. Each data point corresponding to a life is classified in the original dataset as
either diseased or healthy. Out of total 195 points, 147 are diseased. We assume that the
top 147 outliers should convey the information about diseased data points. We calculated
the true positives (TP), false positives (FP), true negatives (TN) and false negatives (TP)
for our results under three different settings. The Table 6.2 display these results includ
P
P
P
, recall T PT+F
and fall-out F PF+T
rate. The outliers were
ing precision T PT+F
P
N
N
predicted with more than 82% of precision and recall.
Similar to the above dataset, we also experimented with the Breast Cancer (Diagnostic) dataset which is bigger than the Parkinsons disease dataset with 30 dimensions and
569 data points. There are 212 malignant and 356 benign data points as given in the original data description. Thus, we analysed the top 212 outliers for being malignant using
118
Chapter 6. Outlier Detection
20
ε=0.01
ε=0.02
ε=0.03
Outlier score
15
10
5
0
0
40
80
Data points
120
160
Figure 6.2: Outlier scores for shape dataset (160 data points in 17 dimensions). The
scores are evaluated with three different -values: 0.01, 0.02, 0.03.
25
ε=0.001
ε=0.005
ε=0.03
Outlier score
20
15
10
5
0
0
50
100
Data points
150
200
Figure 6.3: Outlier scores for Parkinsons Disease dataset (195 data points in 22 dimensions). The scores are evaluated with three different -values: 0.001, 0.005, 0.01.
119
Chapter 6. Outlier Detection
Table 6.2: Evaluation of Parkinsons disease dataset
-value
0.001
0.005
0.03
TP
116
121
106
FP
31
26
41
TN
17
22
7
FN
31
26
41
Precision(%)
78.9
82.3
72.1
Recall(%)
78.9
82.3
72.1
fall-out(%)
64.6
54.2
85.4
Table 6.3: Evaluation of Breast Cancer dataset
-value
0.001
0.005
0.01
TP
160
173
136
FP
52
39
76
TN
305
318
281
FN
52
39
76
Precision(%)
75.5
81.6
64.2
Recall(%)
75.5
81.6
64.2
fall-out(%)
14.6
10.9
21.3
three different parameters. The results are plotted in Figure 6.4. However, the fluctuations between outlier scores seems to be more for = 0.005. We outliers detected from
Breast Cancer (Diagnostic) dataset were also evaluated for performance of our algorithm
and the results are given in Table 6.3. The outliers were predicted with more than 81% of
precision and recall.
As with the other datasets, we can see a reduction in overall outlier scores with bigger
epsilon(). The reason is that a large -value results in more points being packed into the
cluster and therefore, there are very few points which are left out as outliers. Since the
scores are calculated for those data points which are not participating in the clusters, there
will be reduction in the score values with larger . We also analysed the performance of
our outlier ranking through precision, recall and fan-out as done for the above dataset.
The results seem better with = 0.005. We have used heuristics to decide on the epsilon
value. A preliminary test was done on the data to choose a minimum starting epsilon
which can generate clusters in atleast one or two different subspaces.
Finally, we experimented with a 500-dimensional madelon dataset of 4400 points with
: 0.000001, 0.000005, 0.00001. Similar trends between -value and the overall ranking
can be seen in this dataset as well.
120
Chapter 6. Outlier Detection
ε=0.001
ε=0.005
ε=0.01
30
Outlier score
25
20
15
10
5
0
100
200
300
Data points
400
500
Figure 6.4: Outlier scores for Breast Cancer (Diagnostic) dataset (569 data points in 30
dimensions). The scores are evaluated with three different -values: 0.001, 0.005, 0.01.
121
Chapter 6. Outlier Detection
Outlier score
500
ε=0.000001
ε=0.000005
ε=0.00001
480
460
440
420
0
1000
2000
3000
Data points
4000
Figure 6.5: Outlier scores for madelon dataset (4400 data points in 500 dimensions). The
scores are evaluated with three different -values: 0.000001, 0.000005, 0.00001.
All of our ranking computations for these four different datasets took between few
milliseconds to 5 minutes.
6.6
Summary
Poor quality data hampers the efficacy of data analysis and the decision making following
it. Data is collected through automatic or manual processes. These processes introduce
sources of error or may be the data itself can be an anomaly. The outliers are the data
points showing an anomalous behaviour than the rest of the data. While ensuring the
quality of data, data cleaning is a laborious and expensive process. Considering the important role played by the data quality in the credibility of decision making, we cannot
escape data cleaning as a pre-processing step of data analysis.
122
Chapter 6. Outlier Detection
In this chapter, we have presented an outlier detection and ranking algorithm for highdimensional data. Our approach is highly scalable with the dimensions of the data and
efficiently deals with the curse of dimensionality. Our algorithm also gives further insight
into the behaviour of each outlier by giving additional details about its relevant subspaces
and the degree of outlierness exhibited by it. The outlier characterization is deemed important because it can help the users to evaluate the identified outliers and understand the
data better.
Chapter 7
Conclusion and future research
directions
In this thesis, we have worked on the challenging problem of subspace clustering as well
as outlier detection and ranking in high-dimensional data. There has been a plethora
of research work on clustering in the last few years, but due to the exponential search
space with the increase in dimensions, analysing big datasets with high dimensions is a
computationally expensive task. As the discussion of all of this work is out of scope for
this thesis, we have highlighted some of the current and related work in Chapter 2.
We have proposed a novel algorithm called SUBSCALE, which is based on number
theory and finds all possible subspace clusters in the subspaces of high-dimensional data
without using expensive indexing structures or performing multiple data scans. The SUBSCALE algorithm directly computes the maximal subspace clusters and is scalable with
both size and dimensions of data. The algorithm finds groups of similar data points in
all single dimensions based on -distance within 1-dimensional projections of the data
points. These one-dimensional similar groups are broken into fixed sized chunks called
dense units. The points in each dense unit is mapped to a unique key and the sums of
the keys of the points in a dense unit is called its signature. The collision of such sig-
123
124
Chapter 7. Conclusion and future research directions
natures from all single dimensions results in discovery of hidden clusters in the relevant
multi-dimensional subspaces.
Chapter 3 introduced the basic SUBSCALE algorithm while its scalable version is
presented in Chapter 4. We have experimented with numerical datasets of upto 6144
dimensions and our proposed algorithm has been very efficient as well as efective in
finding all possible hidden subspace clusters. The other state-of-the-art clustering algorithms have failed to perform for data of such high dimensionality. The work in Chapter
4 demonstrates that a combination of algorithmic enhancements to the SUBSCALE algorithm and distribution of the computations over a network of workstations can allow a
large dataset to be clustered in just few minutes.
In chapter 5, we have also presented the parallel version of SUBSCALE algorithm to
reduce the time complexity for bigger datasets. The linear speedup with upto 48 cores
looks very promising. However, the shared memory architecture of OpenMP is still a
bottleneck due to the lock mechanism as discussed in Chapter 5. The Message Passing
Interface (MPI) based parallel model can be explored for processing each dimension or
slice locally and using intermittent communication with other nodes. Additionally, the
computing power of General Purpose Graphics Processing Units (GPGPUs) can be harnessed by implementing SUBSCALE algorithm with OpenCL or CUDA. But the algorithm needed to be adapted to minimise the communication overhead while accessing the
common hash table for collisions. The hash table can be managed centrally or replicated
and synchronised periodically among the nodes. The efficient parallel clustering techniques are very much needed for cluster analysis in large-scale data mining applications
in the future.
Also, it would be interesting to look into details of the bell curve we came across
in Chapter 5 (Figure 5.8). The padestrian data of 3661 points in 6144 dimensions was
sliced using sp = 60. This graph was plotted for the total number of signatures generated
across all single dimensions within a LOW and HIGH range of each of the slices. The
large-integer keys assigned to these 3361 data points are completely random as plotted in
125
Chapter 7. Conclusion and future research directions
Figure 5.9. The value of signatures generated from the combinations of dense points in
each of the single dimensions is expected to be random. But, we notice that the number
of signature values lying between LOW and HIGH range near the middle slice number
(30 in this case) is the highest. In addition to exploring the reasons for high turn out of
signatures in the centre, the SUBSCALE algorithm can be optimised further by spitting
rather than in range [0, sp
) or [ sp
, sp), where C is a constant
the computations more near sp
2
C
C
which measures the number of cheaper computations towards the beginning or end of the
number of slices.
Although we have worked with numerical data with no missing values, our work can
be further extended to deal with data with missing values. Using the closeness of data
points in other subspaces, approximations can be made for missing values in correlated
subspaces or dimensions. The concept of similarity based groups can be extended to
categorical data too. While similarity measure for numeric data is distance based, for
categorical data, the number of mismatches between data points or the similar categories
among the data points can be used to find 1-dimensional similarity groups. These similarity groups can be further broken down into the dense units whose collisions can help
find hidden subspace clusters.
Finally, it would be interesting to see many real world applications of SUBSCALE
algorithm especially in microarray data, anomaly detection in cyber space or financial
transactions and/or other high-dimensional data sets. The quality and significance of discovered clusters and outliers can only be verified by the domain experts. We have proposed the outlier ranking algorithm in Chapter 6. The outliers with the highest scores are
the most significant ones and can aid data analysts to set their priorities while cleaning
the data.
The three main contributions of this thesis are:
1. SUBSCALE: A faster and scalable algorithm to find clusters in the subspaces of
high-dimensional data.
126
Chapter 7. Conclusion and future research directions
2. Variants of SUBSCALE contain further improvements to its performance. Also, the
computations of SUBSCALE algorithm can be spread across distributed or parallel
environment for speed-up.
3. Algorithm to detect and rank outliers by their outlying behaviour in the subspaces
of high-dimensional data.
We believe that with our novel algorithms presented in this thesis, we have been able
to further the challenging research field of data analysis for high-dimensional data. We
endeavour to continue to work in the future directions discussed in this chapter.
Bibliography
[1] T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky,
K. A. Marshall, K. H. Phillippy, P. M. Shermpan, M. Holko, A. Yefanov, H. Lee,
N. Zhang, C. L. Robertson, N. Serova, S. Davis, and A. Soboleva, “Ncbi geo:
archive for functional genomics data setsupdate,” Nucleic Acids Research, vol. 41,
no. D1, pp. D991–D995, Jan 2013.
[2] P. E. Dewdney, P. J. Hall, R. T. Schilizzi, and T. Lazio, “The square kilometre
array,” Proceedings of the IEEE, vol. 97, no. 8, pp. 1482–1496, June 2009.
[3] J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” National Science
Review, vol. 1, no. 2, pp. 293–314, June 2014.
[4] M. Steinbach, L. Ertöz, and V. Kumar, “The challenges of clustering high dimensional data,” in New Directions in Statistical Physics. Springer Berlin Heidelberg,
2004, pp. 273–309.
[5] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed.
San Francisco, USA: Morgan Kaufmann Publishers, 2011.
[6] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 224–227, Apr
1979.
127
128
Bibliography
[7] S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300–307, 2007.
[8] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in International Conference on Database
Theory.
Springer Berlin Heidelberg, Jan 2001, pp. 420–434.
[9] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An efficient data clustering
method for very large databases,” in Proc. of the ACM SIGMOD international conference on Management of data, vol. 25, no. 2.
New York, USA: ACM Press,
June 1996, pp. 103–114.
[10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” International Conference
on Knowledge Discovery and Data Mining, vol. 96, no. 34, pp. 226–231, Aug
1996.
[11] R. E. Bellman, Adaptive control processes: A guided tour.
New Jersey, USA:
Princeton University Press, 1961.
[12] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is nearest neighbor meaningful?” Proceedings of the 7th International Conference on Database
Theory, vol. 1540, pp. 217–235, Jan 1999.
[13] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms
for projected clustering,” SIGMOD Record, vol. 28, no. 2, pp. 61–72, June 1999.
[14] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” Advances in knowledge discovery and data mining,
vol. 12, no. 1, pp. 307–328, Feb 1996.
129
Bibliography
[15] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proc. of the ACM
SIGMOD International conference on Management of Data, vol. 27, no. 2, June
1998, pp. 94–105.
[16] L. Parsons, E. Haque, and H. Liu, “Subspace clustering for high dimensional data:
a review,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90–105, June
2004.
[17] M. M. Babu, “Introduction to microarray data analysis,” in Computational Genomics: Theory and Application., G. RP, Ed.
UK: Horizon Press, 2004, pp.
225–249.
[18] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis
and display of genome-wide expression patterns,” Proceedings of the National
Academy of Sciences, vol. 95, no. 25, pp. 14 863–14 868, 1998.
[19] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression data: a
survey,” IEEE Transactions on knowledge and data engineering, vol. 16, no. 11,
pp. 1370–1386, Nov 2004.
[20] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proceedings of
the Eighth International Conference on Intelligent Systems for Molecular Biology,
vol. 8.
American Association for Artificial Intelligence, Aug 2000, pp. 93–103.
[21] S. Yoon, C. Nardini, L. Benini, and G. De Micheli, “Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams,”
IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB),
vol. 2, no. 4, pp. 339–353, Oct 2005.
[22] C. Huttenhower, K. T. Mutungu, N. Indik, W. Yang, M. Schroeder, J. J. Forman,
O. G. Troyanskaya, and H. A. Coller, “Detailing regulatory networks through large
scale data integration,” Bioinformatics, vol. 25, no. 24, pp. 3267–3274, Dec 2009.
130
Bibliography
[23] J. Jun, S. Chung, and D. McLeod, “Subspace clustering of microarray data based on
domain transformation,” in VLDB Workshop on Data Mining and Bioinformatics.
Springer Berlin Heidelberg, Sep 2006, pp. 14–28.
[24] K. Eren, M. Deveci, O. Küçüktunç, and Ü. V. Çatalyürek, “A comparative analysis
of biclustering algorithms for gene expression data,” Briefings in Bioinformatics,
vol. 14, no. 3, pp. 279–292, May 2013.
[25] R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE
transactions on pattern analysis and machine intelligence, vol. 25, no. 2, pp. 218–
233, Feb 2003.
[26] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and
applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 35, no. 11, pp. 2765–2781, Nov 2013.
[27] R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2,
pp. 52–68, Feb 2011.
[28] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, “Clustering appearances of
objects under varying illumination conditions,” in Proceedings of IEEE Computer
Society Conference on Computer Vision and Pattern Recognition, vol. 1, June 2003,
pp. I11–I18.
[29] S. Tierney, J. Gao, and Y. Guo, “Subspace clustering for sequential data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Washington, USA: IEEE Computer Society, June 2014, pp. 1019–1026.
[30] R. Vidal, R. Tron, and R. Hartley, “Multiframe motion segmentation with missing
data using powerfactorization and gpca,” International Journal of Computer Vision,
vol. 79, no. 1, pp. 85–105, Aug 2008.
131
Bibliography
[31] S. Günnemann, B. Boden, and T. Seidl, “Finding density-based subspace clusters
in graphs with feature vectors,” in Data Mining and Knowledge Discovery, vol. 25,
no. 2.
Springer, Sep 2012, pp. 243–269.
[32] W. Jang and M. Hendry, “Cluster analysis of massive datasets in astronomy,” Statistics and Computing, vol. 17, no. 3, pp. 253–262, 2007.
[33] S. G. Djorgovski, A. A. Mahabal, R. J. Brunner, R. R. Gal, S. Castro, R. R. de Carvalho, and S. C. Odewahn, “Searches for rare and new types of objects,” in Virtual
Observatories of the Future, ser. Astronomical Society of the Pacific Conference
Series, vol. 225, 2001.
[34] T. Li, S. Ma, and M. Ogihara, “Document clustering via adaptive subspace iteration,” in Proceedings of the 27th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval.
New York, USA: ACM,
July 2004, pp. 218–225.
[35] C. C. Aggarwal, Data classification: algorithms and applications, 1st ed.
Chap-
man & Hall/CRC, July 2014.
[36] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and Applications,
1st ed.
Chapman & Hall/CRC, 2013.
[37] C. C. Aggarwal, Outlier Analysis.
Springer International Publishing, Apr 2015,
pp. 237–263.
[38] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics
and probability, vol. 1, no. 14, June 1967, pp. 281–297.
[39] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, June 2010.
132
Bibliography
[40] K. Fukunaga, Introduction to Statistical Pattern Recognition.
San Diego, USA:
Academic Press, 1990.
[41] W. Pan, X. Shen, and B. Liu, “Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty,” The Journal of Machine Learning
Research, vol. 14, no. 1, pp. 1865–1889, July 2013.
[42] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. C. Tseng, “Evaluation and
comparison of gene clustering methods in microarray analysis,” Bioinformatics,
vol. 22, no. 19, pp. 2405–2412, Oct 2006.
[43] Google scholar. [Online]. Available: https://scholar.google.com.au
[44] F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,”
The Computer Journal, vol. 26, no. 4, pp. 354–359, Nov 1983.
[45] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. NJ, USA: PrenticeHall, Inc., 1988.
[46] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, Sep 1999.
[47] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on
neural networks, vol. 16, no. 3, pp. 645–678, May 2005.
[48] P. Berkhin, A Survey of Clustering Data Mining Techniques.
Springer, 2006, pp.
25–71.
[49] H.-P. Kriegel, P. Kröger, and A. Zimek, “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,”
ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, no. 1, pp.
1–58, Mar 2009.
133
Bibliography
[50] K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong, “A survey on enhanced subspace clustering,” Data Mining and Knowledge Discovery, vol. 26, no. 2, pp. 332–
397, Mar 2013.
[51] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of
Data Science, vol. 2, no. 2, pp. 165–193, 2015.
[52] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,”
Expert Systems with Applications, vol. 36, no. 2, pp. 3336–3341, Mar 2009.
[53] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster
analysis.
John Wiley & Sons, Sep 2009, vol. 344.
[54] R. T. Ng and J. Han, “Clarans: a method for clustering objects for spatial data
mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 5,
pp. 1003–1016, Sep 2002.
[55] C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high dimensional spaces,” SIGMOD Record, vol. 29, no. 2, pp. 70–81, May 2000.
[56] K.-G. Woo, J.-H. Lee, M.-H. Kim, and Y.-J. Lee, “Findit: a fast and intelligent
subspace clustering algorithm using dimension voting,” Information and Software
Technology, vol. 46, no. 4, pp. 255–271, Mar 2004.
[57] H.-P. H. Kriegel, P. Kröger, M. Renz, and S. Wurst, “A generic framework for efficient subspace clustering of high-dimensional data,” in Proceedings of the Fifth
IEEE International Conference on Data Mining, Washington, DC, USA, Nov
2005, pp. 250–257.
[58] K. Kailing, H.-P. Kriegel, and P. Kröger, “Density-connected subspace clustering
for high-dimensional data,” in Proc. of SIAM International Conference on Data
Mining, vol. 4, Apr 2004, pp. 246–256.
134
Bibliography
[59] I. Assent, R. Krieger, E. Müller, and T. Seidl, “Inscy : Indexing subspace clusters
with in-process-removal of redundancy,” in Eighth IEEE International Conference
on Data Mining, Dec 2008, pp. 719–724.
[60] H. Nagesh, S. Goil, and A. Choudhary, “Adaptive grids for clustering massive data
sets,” Proceedings of the 1st SIAM International Conference on Data Mining, pp.
1–17, Apr 2001.
[61] C.-H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based subspace clustering for mining numerical data,” in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, Aug 1999,
pp. 84–93.
[62] E. Müller, I. Assent, S. Günnemann, and T. Seidl, “Scalable density-based subspace
clustering,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management, ser. CIKM ’11.
New York, USA: ACM, Oct
2011, pp. 1077–1086.
[63] P. Erdös and J. Lehner, “The distribution of the number of summands in the partitions of a positive integer,” Duke Mathematical Journal, vol. 8, no. 2, pp. 335–345,
1941.
[64] W. H. Payne and F. M. Ives, “Combination generators,” ACM Transactions on
Mathematical Software (TOMS), vol. 5, no. 2, pp. 163–172, June 1979.
[65] D. L. Kreher and D. R. Stinson, Combinatorial algorithms: generation, enumeration, and search.
CRC press, Dec 1998, vol. 7.
[66] F. Pedregosa, R. Weiss, and M. Brucher, “Scikit-learn : Machine learning in
python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, Oct
2011.
135
Bibliography
[67] E. Müller, S. Günnemann, I. Assent, T. Seidl, and I. Färber, “http://dme.rwthaachen.de/en/opensubspace/evaluation,” 2009.
[68] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The
weka data mining software: An update,” pp. 10–18, Nov 2009.
[69] M. Lichman, “Uci machine learning repository,” 2013. [Online]. Available:
http://archive.ics.uci.edu/ml
[70] Github repository for subscale algorithm. [Online]. Available: https://github.com/
amkaur/subscale.git
[71] E. Müller, S. Günnemann, I. Assent, T. Seidl, M. Emmanuel, and G. Stephan,
“Evaluating clustering in subspace projections of high dimensional data,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1270–1281, Aug 2009.
[72] S. Jahirabadkar and P. Kulkarni, “Algorithm to determine -distance parameter in
density based clustering,” Expert Systems with Applications, vol. 41, no. 6, pp.
2939–2946, May 2014.
[73] I. Assent, R. Krieger, E. Müller, and T. Seidl, “Dusc: Dimensionality unbiased
subspace clustering,” in Seventh IEEE International Conference on Data Mining
(ICDM 2007).
IEEE Computer Society, Oct 2007, pp. 409–414.
[74] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer,
M. C. Schatz, S. Sinha, and G. E. Robinson, “Big data: Astronomical or genomical?” PLoS Biology, vol. 13, no. 7, pp. 1–11, July 2015.
[75] A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan, “Big data clustering: a review,” in International Conference on Computational Science and Its
Applications.
Springer, June 2014, pp. 707–720.
136
Bibliography
[76] V. Turner, J. F. Gantz, D. Reinsel, and S. Minton, “The digital universe of opportunities: Rich data and the increasing value of the internet of things,” IDC iView:
IDC Analyze the future, pp. 1–10, Apr 2014.
[77] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti
dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–
1237, Aug 2013.
[78] S. M. Bileschi, “Streetscenes: Towards scene understanding in still images,” Ph.D.
dissertation, Massachusettes Institute of Technology, 2006.
[79] J. Zhu, S. Liao, Z. Lei, D. Yi, and S. Z. Li, “Pedestrian attribute classification in
surveillance: Database and evaluation,” in ICCV workshop on Large-Scale Video
Search and Mining (LSVSM’13), Sydney, Australia, Dec 2013, pp. 331–338.
[80] Github repository for scalable subscale algorithm. [Online]. Available: https:
//github.com/amkaur/subscaleplus.git
[81] A. Kaur and A. Datta, “Subscale: Fast and scalable subspace clustering for high
dimensional data,” in IEEE International Conference on Data Mining Workshop,
Dec 2014, pp. 621–628.
[82] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory
programming,” IEEE Computational Science Engineering, vol. 5, no. 1, pp. 46–55,
Jan 1998.
[83] I. T. Joliffe, Principle Component Analysis, 2nd ed.
New York, USA: Springer,
2002.
[84] B. Zhu, A. Mara, and A. Mozo, “Clus: Parallel subspace clustering algorithm on
spark,” in New Trends in Databases and Information Systems, ser. Communications
in Computer and Information Science.
2015, vol. 539, pp. 175–185.
Springer International Publishing, Sep
137
Bibliography
[85] T. Dasu and T. Johnson, Exploratory data mining and data cleaning. John Wiley
& Sons, 2003, vol. 479.
[86] E. Rahm and H. H. Do, “Data cleaning: Problems and current approaches,” IEEE
Data Engineering Bulletin, vol. 23, no. 4, pp. 3–13, Dec 2000.
[87] Y. W. Lee, L. L. Pipino, J. D. Funk, and R. Y. Wang, Journey to Data Quality. The
MIT Press, Sep 2009.
[88] W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee, “A taxonomy of dirty
data,” Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99, Jan 2003.
[89] J. W. Osborne and A. Overbay, “The power of outliers (and why researchers should
always check for them),” Practical assessment, research & evaluation, vol. 9, no. 6,
pp. 1–12, Nov 2004.
[90] T. C. Redman, “The impact of poor data quality on the typical enterprise,” ACM
Communications, vol. 41, no. 2, pp. 79–82, Feb 1998.
[91] A. Haug, F. Zachariassen, and D. Van Liempd, “The costs of poor data quality,”
Journal of Industrial Engineering and Management, vol. 4, no. 2, pp. 168–193,
July 2011.
[92] L. P. English, “Information quality: Critical ingredient for national security,” Journal of Database Management, vol. 16, no. 1, pp. 18–32, Jan 2005.
[93] O. of Inspector General, “Undeliverable as addressed mail,” United States Postal
Service, Tech. Rep. MS-AR-14-006, 2014.
[94] E. D. Quality, “The data quality benchmark report,” Experian Data Quality, pp.
1–10, Jan 2015.
[95] H. C. Koh and G. Tan, “Data mining applications in healthcare,” Journal of Healthcare Information Management, vol. 19, no. 2, pp. 64–72, Jan 2011.
138
Bibliography
[96] N. G. Weiskopf and C. Weng, “Methods and dimensions of electronic health record
data quality assessment: enabling reuse for clinical research,” Journal of the American Medical Informatics Association, vol. 20, no. 1, pp. 144–151, 2013.
[97] W. Rosenberg and A. Donald, “Evidence based medicine: an approach to clinical
problem-solving,” British Medical Journal, vol. 310, no. 6987, pp. 1122–1126, Apr
1995.
[98] A. R. Feinstein and R. I. Horwitz, “Problems in the evidence of evidence-based
medicine,” The American Journal of Medicine, vol. 103, no. 6, pp. 529–535, Dec
1997.
[99] D. J. Berndt, J. W. Fisher, A. R. Hevner, and J. Studnicki, “Healthcare data warehousing and quality assurance,” Computer, vol. 34, no. 12, pp. 56–65, Dec 2001.
[100] R. Y. Wang and D. M. Strong, “Beyond accuracy: What data quality means to data
consumers,” Journal of management information systems, vol. 12, no. 4, pp. 5–33,
Mar 1996.
[101] M. Juran Joseph and A. Blanton Godfrey, Juran’s quality handbook.
McGraw
Hill, 1999.
[102] T. C. Redman, Data Quality: The Field Guide.
Digital press, 2001.
[103] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications).
Secaucus, NJ, USA: Springer-
Verlag, 2006.
[104] A. D. Chapman, “Principles of data quality,” Global Biodiversity Information Facility, Copenhagen, Tech. Rep., July 2005.
[105] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, “Methodologies for data
quality assessment and improvement,” ACM computing surveys, vol. 41, no. 3, pp.
16:1–16:52, July 2009.
139
Bibliography
[106] W. Fan and F. Geerts, Foundations of Data Quality Management.
Morgan and
Claypool, July 2012, vol. 4, no. 5.
[107] J. I. Maletic and A. Marcus, “Data cleansing: Beyond integrity analysis,” in MIT
Conference on Information Quality, Oct 2000, pp. 200–209.
[108] J. Van den Broeck, S. A. Cunningham, R. Eeckels, and K. Herbst, “Data cleaning: Detecting, diagnosing, and editing data abnormalities,” PLoS Medicine, vol. 2,
no. 10, pp. 966–970, Sep 2005.
[109] P. Filzmoser, R. Maronna, and M. Werner, “Outlier identification in high dimensions,” Computational Statistics & Data Analysis, vol. 52, no. 3, pp. 1694–1711,
Jan 2008.
[110] V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial
Intelligence Review, vol. 22, no. 2, pp. 85–126, Oct 2004.
[111] D. M. Hawkins, Identification of Outliers, ser. Monographs on Applied Probability
and Statistics.
London: Chapman and Hall, May 1980, vol. 11.
[112] V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed. John Wiley & Sons,
1994.
[113] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised outlier detection in high-dimensional numerical data,” Statistical Analysis and Data Mining,
vol. 5, no. 5, pp. 363–387, Oct 2012.
[114] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM
Computing Surveys, vol. 41, no. 3, pp. 15:1–15:58, July 2009.
[115] E. M. Knox and R. T. Ng, “Algorithms for mining distance-based outliers in large
datasets,” in Proceedings of the 24th International Conference on Very Large Data
Bases.
403.
San Francisco, USA: Morgan Kaufmann Publishers, Aug 1998, pp. 392–
140
Bibliography
[116] T. Johnson, I. Kwok, and R. T. Ng, “Fast computation of 2-dimensional depth contours,” in Proceedings of 4th International Conference on Knowledge Discovery
and Data Mining.
New York, USA: American Association for Artificial Intelli-
gence, Aug 1998, pp. 224–228.
[117] S. Ramaswamy, R. Rastogi, K. Shim, and K. S. Ramaswamy, Sridhar, Rajeev rastogi, “Efficient algorithms for mining outliers from large data sets,” ACM SIGMOD
Record, vol. 29, no. 2, pp. 427–438, May 2000.
[118] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying densitybased local outliers,” ACM Sigmod Record, vol. 29, no. 2, pp. 1–12, May 2000.
[119] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, “Loci: Fast outlier
detection using the local correlation integral,” in Proceedings of 19th International
Conference on Data Engineering.
IEEE, Mar 2003, pp. 315–326.
[120] A. Ghoting, S. Parthasarathy, and M. E. Otey, “Fast mining of distance-based
outliers in high-dimensional datasets,” Data Mining and Knowledge Discovery,
vol. 16, no. 3, pp. 349–364, June 2008.
[121] Y. Wang, S. Parthasarathy, and S. Tatikonda, “Locality sensitive outlier detection:
A ranking driven approach,” Proceedings of 27th International Conference on Data
Engineering, pp. 410–421, Apr 2011.
[122] H.-P. Kriegel, M. S hubert, and A. Zimek, “Angle-based outlier detection in highdimensional data,” Proceeding of the 14th ACM SIGKDD International conference
on Knowledge Discovery and Data Mining, pp. 444–452, Aug 2008.
[123] I. Ruts and P. J. Rousseeuw, “Computing depth contours of bivariate point clouds,”
Computational Statistics & Data Analysis, vol. 23, no. 1, pp. 153–168, Nov 1996.
141
Bibliography
[124] E. Müller, M. Schiffer, and T. Seidl, “Statistical selection of relevant subspace
projections for outlier ranking,” in IEEE 27th International Conference on Data
Engineering, Apr 2011, pp. 434–445.
[125] J. Zhang and H. Wang, “Detecting outlying subspaces for high-dimensional data:
The new task, algorithms, and performance,” Knowledge and Information Systems,
vol. 10, no. 3, pp. 333–355, Oct 2006.
[126] F. Keller, E. Muller, and K. Bohm, “Hics: high contrast subspaces for density-based
outlier ranking,” in IEEE 28th International Conference on Data Engineering, Apr
2012, pp. 1037–1048.
[127] E. M. Knorr and R. T. Ng, “Finding intentional knowledge of distance-based outliers,” in Proceedings of 25th International Conference on Very Large Data Bases,
vol. 99.
Morgan Kaufmann Publishers, Sep 1999, pp. 211–222.
[128] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” ACM
Sigmod Record, vol. 30, no. 2, pp. 37–46, May 2001.
[129] J. Zhang, M. Lou, and T. Ling, “Hos-miner: a system for detecting outlyting subspaces of high-dimensional data,” in Proceedings of the 30th International Conference on Very Large Databases, vol. 30.
VLDB Endowment, Aug 2004, pp.
1265–1268.
[130] H. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Outlier detection in axis-parallel
subspaces of high dimensional data,” in Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, vol. 1.
Springer
Berlin Heidelberg, Apr 2009, pp. 831–838.
[131] M. A. Little, P. E. McSharry, S. J. Roberts, D. A. Costello, and I. M. Moroz, “Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,” BioMedical Engineering OnLine, vol. 6, no. 1, pp. 1–19, June 2007.