* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Fast and Scalable Subspace Clustering of High Dimensional Data
Human genetic clustering wikipedia , lookup
Principal component analysis wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Nearest-neighbor chain algorithm wikipedia , lookup
Fast and Scalable Subspace Clustering of High Dimensional Data by Amardeep Kaur A thesis presented for the degree of Doctor of Philosophy School of Computer Science and Software Engineering The University of Western Australia Crawley, WA 6009, Australia 2016 Dedicated to my late mother Abstract Due to the availability of sophisticated data acquisition technologies, increasingly detailed data is being captured through diverse sources. Such detailed data leads to a high number of dimensions. A dimension represents a feature or an attribute of a data point. There is an emergent need to find groups of similar data points called ‘clusters’ hidden in these high-dimensional datasets. Most of the clustering algorithms which perform very well with low dimensions, fail for the high number of dimensions. In this thesis, we focus our interest in designing efficient solutions for the clustering problem in high-dimensional data. In addition to finding similarity groups in high-dimensional data, there is an increasing interest in finding dissimilar data points as well. Such dissimilar data points are called outliers. Outliers play an important role in the data cleaning process which is used to improve the data quality. We aid the expensive data cleaning process by facilitating additional knowledge about the outliers in high-dimensional data. We find outliers in their relevant sets of dimensions and rank them by the strength of their outlying behaviour. We first study the properties of high-dimensional data and identify the reasons for inefficiencies in the current clustering algorithms. We find that different combinations of data dimensions reveal different possible groupings among the data. These possible combinations or subsets of dimensions of the data are called subspaces. Each data point represents measurements of a phenomenon over many dimensions. A dataset can be better understood by clustering it in its relevant subspaces and this process is called subspace clustering. There is a growing demand for efficient and scalable subspace clustering solutions in many application domains like biology, computer vision, astronomy and social i networking. But the exponential growth in the number of subspaces with the data dimensions makes the whole process of subspace clustering computationally very expensive. Some of the clustering algorithms look for a fixed number of clusters in pre-defined subspaces. Such algorithms diminish the whole idea of discovering previously unknown and hidden clusters. We cannot have prior information of the relevant subspaces or the number of clusters. The iterative process of combining lower-dimensional clusters into higher-dimensional clusters in a bottom-up fashion is a promising subspace clustering approach. However, the performance of existing subspace clustering algorithms based on this approach deteriorates with the increase in data dimensionality. Most of these algorithms require multiple database scans to generate an index structure for enumerating the data points in multiple subspaces. Also, a large number of redundant subspace clusters are generated, either implicitly or explicitly, during the clustering process. We present SUBSCALE, a novel and an efficient clustering algorithm to find all hidden subspace clusters in the high-dimensional data with minimal cost and optimal quality. Unlike other bottom-up subspace clustering algorithms, neither does our algorithm rely on the step-by-step iterative process of joining lower-dimensional candidate clusters nor does it selectively choose any user-defined subspace. Our algorithm directly steers toward the higher dimensional clusters from one-dimensional clusters without the expensive process of joining each and every intermediate clusters. Our algorithm is based on a novel idea from number theory and effectively avoids the cumbersome enumeration of data points in multiple subspaces. Moreover, the SUBSCALE algorithm requires only k database scans for a k-dimensional dataset. Other salient features of the SUBSCALE algorithm are that it does not generate any redundant clusters and is much more scalable as well as faster than the existing state-of-the-art algorithms. Several relevant experiments were conducted to compare the performance of our algorithm with the state-of-the-art algorithms and the results are promising. Although the SUBSCALE algorithm scales very well with the dimensionality of the data, the only computational hurdle is the generation of one-dimensional candidate clus- ii ters. All of these one-dimensional clusters are required to be kept in the computer’s working memory to be combined effectively. Because of this, random access memory requirements are expected to grow substantially for the bigger datasets. Nonetheless, an important property of the SUBSCALE algorithm is that the process of computing each subspace cluster is independent of the others. This property helped us to improve the SUBSCALE algorithm so that it can process the data to find subspace clusters even with a limited working memory. The clustering computations can be split into any granularity level so that one or more computation chunk can fit into the available working memory. The scalable SUBSCALE algorithm can also be distributed across multiple computer systems with smaller processing capabilities for faster results. The scalability performance was studied with upto 6144 dimensions where the recent subspace clustering algorithms broke down for few tens of dimensions. To speed up the clustering process for high-dimensional data, we also propose a parallel version of the subspace clustering algorithm. The parallel SUBSCALE algorithm is based on shared-memory architecture and exploits the computational independence in the structure of the SUBSCALE algorithm. We aim to leverage the computational power of widely available multi-core processors and improve the runtime performance of the SUBSCALE algorithm. We parallelized the SUBSCALE algorithm and first experimented with processing the candidate clusters from single dimensions in parallel. But in this implementation, there was an unavoidable requirement of mutual exclusive access to certain portions of the working memory, which created a bottleneck in the performance of parallel algorithm. We modified the algorithm further to overcome this performance hindrance and sliced the computations in a way that at any given time no two threads will try to access the same block of memory. The experimental evaluation with upto 48 cores has shown linear speed-up. Although largely automatic collection of data has opened new frontiers for analysts to gain knowledge insights, it has also introduced wide sources of error in the data. Hence, the data quality problem is becoming increasingly exigent. The reliability of any data iii analysis depends upon the quality of the underlying data. It is well known that data cleaning is a laborious and an expensive process. Data cleaning involves detecting and removing the abnormal values called outliers. The outlier identification becomes harder as the data dimensionality increases. Similar to the clusters, outliers show their anomalous behaviours in the locally relevant subspaces of the data and because of the exponential search space of high-dimensional data, it is extremely challenging to detect outliers in all possible subspaces. Moreover, a data point existing as an outlier in one subspace can exist as a normal data point in another subspace. Therefore, it is important that when identifying an outlier, a characterisation of its outlierness is also given. These additional details can aid a data analyst to make important decisions about whether an outlier should be removed, fixed or left unchanged. We propose an effective outlier detection algorithm for high-dimensional data as an extension of the SUBSCALE algorithm. We also provide an effective methodology to rank outliers by strength of their outlying behaviour. Our outlier detection and ranking algorithm does not make any assumptions about the underlying data distribution and can adapt to different density parameter settings. We experimented with different datasets and the top-ranked outliers were predicted with more than 82% precision and recall. A low or tighter density threshold reveals more data points as outliers while a higher or loose density threshold allow more data points to be part of one or more clusters, and therefore, lowers the overall ranking. With our outlier detection and ranking algorithm, we aim to aid the data analysts with better characterisation of each outlier. In this thesis, we endeavour to further the data mining research for high-dimensional datasets by proposing various efficient as well as effective techniques to detect and handle the similar and dissimilar data patterns. iv Acknowledgements The PhD journey has been a learning experience for me, both at personal and professional front. I would like to thank some of the many names who have helped me in various ways to complete this thesis. First and foremost, I would like to offer sincere gratitude to my principal supervisor Professor Amitava Datta for his patience, encouragement and overall support. My writing and research skills have considerably improved compared to where I stood at the start of this PhD, mainly because of his positive and non-judgemental criticism along with continuous guidance. Thank you for sharing your wealth of knowledge and giving me this great opportunity to learn. I am also grateful to my co-supervisor Associate Professor Chris McDonald for his help in proof reading and providing useful feedback. While assisting him in the university teaching activities, I learnt a lot by observing the thoughtfulness and sheer hard-work he put for his students. I acknowledge the financial and overall support received by the Australian Government through Endeavour Postgraduate Award. Their professional workshops and regular contacts by the case managers have been invaluable. The supercomputing training by Pawsey Supercomputing Centre was of immense help. I thank IBM SoftLayer for providing their server for research. I would also like to thank the anonymous reviewers whose comments and feedback helped me improve my publications and subsequent thesis-work. I offer my gratitude to the peaceful and serene university campus situated on the spiritual Noongar land. The Graduate Research School had many informative workshops and seminars to support throughout my research journey. I am thankful for the technical and v administrative support available through my School of Computer Science and Software Engineering. My heartfelt thanks to Dr. Anita Fourie from student support services for being a good listener and a life-affirming pillar during those spaces plagued by a mix of uncertainties. The discussions with my lab colleagues Nasrin, Alvaro, Kwan, Mubashar and Noha have been both a learning and a memorable experience. Special thanks to Noha for her care and concern all this time. I am grateful for the lovely bunch of friends especially Arshinder, Lakshmi, Feng and Darcy for their love and support. Many thanks to Catherine who was instrumental for the start of this journey. Also, to my lost friend Setu for believing in me more than I believed in myself. The biggest debt is of my adorable father, Jaswinder Singh Dua, whom I can never repay for his unconditional love. I am thankful to him for letting me have my wings and always standing by me, no matter what. Lastly, my taste-buds cannot escape without thanking Connoisseur’s Cookies & Cream ice-cream which was always there to fall back upon, whatever be the reason and the season. vi Publications 1. Kaur, A. & Datta, A. A novel algorithm for fast and scalable subspace clustering of high-dimensional data. In : Journal of Big Data. 2, 17, p. 1-24, 2015 2. Kaur, A. & Datta, A. SUBSCALE: Fast and scalable subspace clustering for high dimensional data. In: Proceedings IEEE International Conference on Data Mining Workshops, ICDMW. p. 621-628, 2014 vii Contribution to thesis My contribution to the thesis was 85%. I developed and implemented the idea, designed the experiments, analysed the results and wrote the manuscript. My supervisor, Professor Amitava Datta contributed for the underlying idea and played a pivotal role guiding and supervising throughout, from initial conception to the final submission of this manuscript viii Contents 1 2 Introduction 1 1.1 Curse of dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Subspace clustering problem . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Apriori principle . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Motivating examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Thesis organisaton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 Literature Review 10 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Partitioning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 K-means and variants . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Projected clustering . . . . . . . . . . . . . . . . . . . . . . . . . 13 Non-partitioning algorithms . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Full-dimensional based algorithms . . . . . . . . . . . . . . . . . 14 2.3.2 Subspace clustering . . . . . . . . . . . . . . . . . . . . . . . . . 16 Desirable properties of subspace clustering . . . . . . . . . . . . . . . . 21 2.3 2.4 3 A novel fast subspace clustering algorithm 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Exponential search space . . . . . . . . . . . . . . . . . . . . . . 25 3.1.2 Redundant clusters . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.1.3 Pruning and redundancy . . . . . . . . . . . . . . . . . . . . . . 27 ix 3.1.4 3.2 3.3 3.4 4 5 Multiple database scans and inter-cluster comparisons . . . . . . 28 Research design and methodology . . . . . . . . . . . . . . . . . . . . . 29 3.2.1 Definitions and problem . . . . . . . . . . . . . . . . . . . . . . 29 3.2.2 Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.3 Assigning signatures to dense units . . . . . . . . . . . . . . . . 34 3.2.4 Interleaved dense units . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.5 Generation of combinatorial subsets . . . . . . . . . . . . . . . . 38 3.2.6 SUBSCALE algorithm . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.7 Removing redundant computation of dense units . . . . . . . . . 42 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.2 Execution time and quality . . . . . . . . . . . . . . . . . . . . . 49 3.3.3 Determining the input parameters . . . . . . . . . . . . . . . . . 56 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Scalable subspace clustering 58 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Memory bottleneck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Collisions and the hash table . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Splitting hash computations . . . . . . . . . . . . . . . . . . . . 66 4.4 Scalable SUBSCALE algorithm . . . . . . . . . . . . . . . . . . . . . . 68 4.5 Experiments and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Parallelization 77 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.3 Parallel subspace clustering . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.1 83 SUBSCALE algorithm . . . . . . . . . . . . . . . . . . . . . . . x 5.3.2 5.4 6 87 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.4.2 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.3 Speedup with multiple cores . . . . . . . . . . . . . . . . . . . . 93 5.4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Outlier Detection 99 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Outliers and data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Current methods for outlier detection . . . . . . . . . . . . . . . . . . . . 104 6.4 7 Parallelization using OpenMP . . . . . . . . . . . . . . . . . . . 6.3.1 Full-dimensional based approaches . . . . . . . . . . . . . . . . 104 6.3.2 Subspace based approaches . . . . . . . . . . . . . . . . . . . . 105 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.4.1 Anti-monotonicity of the data proximity . . . . . . . . . . . . . . 108 6.4.2 Minimal subspace of an outlier . . . . . . . . . . . . . . . . . . . 110 6.4.3 Maximal subspace shadow . . . . . . . . . . . . . . . . . . . . . 114 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Conclusion and future research directions xi 123 List of Figures 1.1 Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Data grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Bottom-up clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Data partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Core and border data points in DBSCAN . . . . . . . . . . . . . . . . . . 15 3.1 Bottom-up clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 Projections of dense points . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Projections of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4 Matching dense units across dimensions . . . . . . . . . . . . . . . . . . 34 3.5 Numerical experiments for probability of collisions . . . . . . . . . . . . 36 3.6 Experiments with Erdos Lemma . . . . . . . . . . . . . . . . . . . . . . 37 3.7 Collisions among signatures . . . . . . . . . . . . . . . . . . . . . . . . 41 3.8 An example of sorted data points in a single dimension . . . . . . . . . . 42 3.9 An example of overlapping between consecutive core-sets of dense data points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.10 An example of using pivot to remove redundant computations of dense units from the core-sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.11 Effect of on runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.12 vs F1 measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.13 Runtime comparison for similar quality of clusters . . . . . . . . . . . . 52 xii 3.14 Runtime comparison for different quality of clusters . . . . . . . . . . . . 53 3.15 Runtime comparison between different subspace clustering algorithms for fixed data size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.16 Runtime comparison between different subspace clustering algorithms for fixed dimensionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.17 Number of subspaces found vs runtime . . . . . . . . . . . . . . . . . . . 55 4.1 Number of clusters vs size of the dataset . . . . . . . . . . . . . . . . . . 59 4.2 Data sparsity with increase in the number of dimensions . . . . . . . . . 60 4.3 Internal structure of a signature node . . . . . . . . . . . . . . . . . . . . 62 4.4 Signature collisions in a hash table . . . . . . . . . . . . . . . . . . . . . 63 4.5 Illustration of splitting hT able computations . . . . . . . . . . . . . . . . 67 4.6 Runtime vs split factor for madelon dataset . . . . . . . . . . . . . . . . 74 5.1 Projections of dense points . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2 Structure of signature node . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.3 hT able data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.4 Allocating separate thread to each dimension . . . . . . . . . . . . . . . 89 5.5 Multiple threads for dimensions . . . . . . . . . . . . . . . . . . . . . . 94 5.6 Multiple threads for slices . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.8 Bell curve of signatures generated in each slice . . . . . . . . . . . . . . 97 5.9 Distribution of values in keys . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1 Outlier in trivial subspace . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 Outlier scores for shape dataset . . . . . . . . . . . . . . . . . . . . . . 118 6.3 Outlier scores for Parkinsons Disease dataset . . . . . . . . . . . . . . . 118 6.4 Outlier scores for Breast Cancer (Diagnostic) dataset . . . . . . . . . . . 120 6.5 Outlier scores for madelon dataset . . . . . . . . . . . . . . . . . . . . . 121 xiii List of Tables 1.1 Data matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.1 Marks dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Clusters in the Marks dataset . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 List of datasets used for evaluation . . . . . . . . . . . . . . . . . . . . . 49 4.1 Number of subspaces with increase in dimensions . . . . . . . . . . . . . 60 6.1 Outlier removal dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Evaluation of Parkinsons disease dataset . . . . . . . . . . . . . . . . . . 119 6.3 Evaluation of Breast Cancer dataset . . . . . . . . . . . . . . . . . . . . 119 xiv Chapter 1 Introduction With recent technological advancements, high-dimensional data are being captured in almost every conceivable area, ranging from astronomy to biological sciences. Thousands of microarray data repositories have been created for gene expression investigation [1]; sophisticated cameras are becoming ubiquitous, generating a huge amount of visual data for surveillance; the Square Kilometre Array Telescope is being built for astrophysics research and is expected to generate several petabytes of astronomical data every hour [2]. All of these datasets have more than hundreds or thousands of dimensions and the number of dimensions is increasing with better data capturing technologies day by day. The dimensions of the dataset is also known as its attributes or features. The dimensionally rich data poses significant research challenges for the data mining community [3, 4]. Clustering is one of the important data mining tasks to explore and gain useful information from the data [5]. Very often, it is desirable to identify natural structures of similar data points, for example, customers with similar purchasing behaviour, genes with similar expression profiles, stars or galaxies with similar properties. Clustering can also be seen as an extension of basic human nature to identify and categorize the things around. Clustering is an unsupervised process to discover these hidden structures or groups called clusters, based on similarity criteria and without any prior information of the underlying data distribution. 1 2 Chapter 1. Introduction Cluster Figure 1.1: Clusters. Figure 1.1 is a pictorial representation of grouping two-dimensional points into clusters. We notice that some of these points do not participate in any of the clusters. To illustrate the clustering process in brief, consider an n × k dataset DB of k dimensions such that, each data point Pi is measured as a k-dimensional vector: Pi1 , Pi2 , . . . , Pik where, Pid , 1 ≤ d ≤ k, is the value of a data point Pi in the dth dimension. We assume the data in a metric space (Table 1.1). A cluster C is a set of points which are similar based on a similarity threshold. Thus, points Pi and Pj participate in the same cluster if sim(Pi , Pj ) = true. Table 1.1: Data matrix P1 P2 .. . Pn−1 Pn d1 P11 P21 d2 P12 P22 1 Pn−1 Pn1 2 Pn−1 Pn2 ... ... ... .. . ... ... dk−1 P1k−1 P2k−1 dk P1k P2k k−1 Pn−1 Pnk−1 k Pn−1 Pnk Similarity measure A variety of distance measures can be used to quantify the similarity of the data points [6–8]. Distance is one of the commonly used measures of similarity in metric data. The shorter the distance between two data points, the more similar they will be. Lp -norm 3 Chapter 1. Introduction calculates the distance between two k dimensional points Pi and Pj by comparing values of their k dimensions (also called features) cf. Equation 1.1. v u k uX p distance(Pi , Pj ) = Lp (Pi , Pj ) = t (Pi − Pj )p (1.1) d=1 L1 and L2 are two important forms of Lp norm widely used in clustering cf. Equations 1.2 and 1.3 respectively. L1 is also called City block distance or Manhattan distance and L2 is called Euclidean distance. v u k uX p L1 (Pi , Pj ) = t |Pi − Pj | (1.2) d=1 v u k uX (Pi − Pj )2 L2 (Pi , Pj ) = t (1.3) d=1 Most of the clustering algorithms generate clusters by measuring proximity between the data points through Lp distance and using either all or a subset of dimensions [9, 10]. Two points Pi and Pj belong to the same cluster if Lp (Pi , Pj ) ≤ threshold. The proximity threshold is decided by the user along with the density criterion. The density parameter tells how many points should lie within a close neighbourhood in a data space so that this region can be called a cluster. However, as the number of dimensions increases, the distance/density measurements fail to detect meaningful clusters due to a phenomenon called the Curse of dimensionality and is discussed below. 1.1 Curse of dimensionality Clustering high-dimensional data is difficult due to unique constraints imposed by large number of dimensions, known as Curse of dimensionality - a term coined by Richard Bellman [11]. There are two implications of curse of dimensionality, first, on the similarity measure and the other on the irrelevant attributes. According to Beyer et al. [12], as the 4 Chapter 1. Introduction d2 P1 d3 d3 P1 P2 P1 P2 P2 d1 d4 d4 P1 d1 d2 P2 d4 P2 P2 P1 d1 d2 P1 d3 Figure 1.2: Data group together differently under different subsets of dimensions. dimensionality of data grows, data points tend to become equally distant from each other and thus, relative contrast among similar and dissimilar points becomes less. The second implication is the presence of irrelevant dimensions in high-dimensional datasets. Data tend to group together differently under different subsets of dimensions (attributes) and not all dimensions are relevant together at a time. For example, a 4-dimensional dataset can be projected onto a 2-dimensional space in six different ways. Figure 1.1 shows possible relationships among two different points P1 and P2 under different subsets of dimensions. We notice that only one of the points P1 and P2 participates in a cluster formation when projected on dimensions {d1 , d2 } and {d2 , d3 } while both of them are part of either same or different clusters in dimensions {d1 , d4 } and {d2 , d4 }. There is no cluster formation in dimensions {d3 , d4 } and both points stay out of the cluster in dimensions {d1 , d3 }. To identify each of these relationships, we need to find clusters with respect to particular relevant sets of dimensions. As a subset of dimensions is called a subspace, these clusters existing in the subspaces of the data are called subspace clusters. The data points in a subspace cluster are similar to each other in all dimensions attached to this subspace. 5 Chapter 1. Introduction Both of the above concerns of high-dimensional data implies that the useful clusters can only be found in lower-dimensional subspaces and all possible subspace clusters should be discovered. 1.2 Subspace clustering problem The subspace clustering is a branch of clustering which endeavours to find all hidden subspace clusters. There is also an allied branch of clustering algorithms called projected clustering where a user prescribes the number of subspace clusters to be found and each data point can belong to atmost one cluster [13]. But this is more of a data partitioning approach than an exhaustive search for hidden subspace clusters. An important property of subspace clustering is that we do not have prior information about the data points and dimensions participating in it. Thus, the only possible approach is to perform an exhaustive search for similar data points in all possible subspaces. Moreover, the number of hidden clusters and the relevant subspaces should be an output rather than an input of a clustering algorithm. A k-dimensional dataset can have upto 2d −1 axesparallel subspaces. The number of subspaces is exponential in their dimension, e.g. there are 1023 subspaces for a 10-dimensional dataset and 1.04 million for a 20-dimensional dataset. The large number of dimensions thus dramatically increases the possibilities of grouping data points. Thus, the number of subspace clusters can far exceed the data size. This exponential search makes subspace clustering a complex and challenging task. Most of the subspace clustering algorithms use bottom-up search strategy based on Apriori principle [14], which also helps to prune the redundant clusters. 1.2.1 Apriori principle According to the Apriori principle, if a group of points form a cluster C in a d-dimensional space then C is also a part of some cluster in the lower (d − 1)-dimensional projection of this space. The downward closure property of this principle implies that cluster C will 6 Chapter 1. Introduction be redundantly present in all 2d − 1 projections of this d-dimensional space. We call this cluster C, a maximal cluster, which is intuitively a cluster in a subspace of maximum possible dimensionality and it also means that this cluster cease to exist if we increase the dimensionality of subspace even by one. It is not necessary to detect the non-maximal clusters because they can be detected anyway as projections of maximal clusters. However, most of algorithms implicitly or explicitly compute these trivial clusters during the clustering process. The second problem of excessive database scans arises as most algorithms construct clusters from dense units, smaller clusters that are occupied by a sufficient number of points. The database scans are required for determining the occupancy of the dense units while constructing subspace clusters bottom up; to check whether the same points occupy the next higher-dimensional dense unit while progressing from a lower-dimensional dense unit. Subspace clustering is a very complex and challenging task for the high-dimensional data as the number of subspaces is exponential in dimensions. Most of the subspace clustering algorithms use a bottom-up approach based on the downward closure property of Apriori principle [15]. In this approach, density based similarity measures are used to find the clusters in the lower-dimensional subspaces, starting from 1-dimensional clusters, which are combined together iteratively to form the clusters in the higher-dimensional subspaces (Figure 1.3). Although these algorithms can find arbitrary-shaped subspace clusters, they fail to scale with the dimensions. The speed as well as the quality of clustering is of major concern [16]. 1.3 Motivating examples With the emergence of new applications, the area of subspace clustering is of critical importance. Following are some of the examples which cannot be solved by the traditional clustering algorithms due to their size, dimensionality and focus of interest: 7 Chapter 1. Introduction Figure 1.3: Bottom-up clustering. Lower-dimensional clusters are joined with each other to obtain higher-dimensional clusters. – In biology, high throughput gene expression data obtained from microarray chips forms a matrix [17]. Each cell in this matrix contains the expression level of a gene (row) under an experimental condition (column). The genes which co-express together under the subsets of experimental conditions are likely to be functionally similar [18]. One of the interesting characteristics of this data is that both genes (rows) and experimental conditions (columns) can be clustered for meaningful biological inferences. One of the assumptions in molecular biology is that only a subset of genes are expressed under a subset of experimental conditions, for a particular cellular process [19]. Also, a gene or an experimental condition can participate in more than one cellular process allowing the existence of overlapping gene-clusters in different subspaces. Cheng and Church [20] were the first to introduce biclustering which is just an extension of subspace clustering, for microarray datasets. Since then, many subspace clustering algorithms have been designed for understanding the cellular processes [21] and gene regulatory networks [22], assisting in disease diagnosis [23] and thus, better medical treatments. Eren et al. [24] have recently compared the performance of related subspace clustering algorithms in microarray data and because of the combinatorial nature of the solution space, subspace clustering is still a challenge in this domain. 8 Chapter 1. Introduction – Many computer vision problems are associated with matching images, scenes, or motion dynamics in video sequences. Image data is very high-dimensional, e.g, a low-end 3.1 megapixel camera can capture an 2048 × 1536 image of 3145728 dimensions. It has been shown that the solutions to these high-dimensional computer vision problems lie in finding the structures of interest in the lower-dimensional subspaces [25–27]. As a result subspace clustering is very important in many computer vision and image processing problems, e.g, recognition of faces and moving objects. Face recognition is a challenging area as the images of the same object may look entirely different under different illumination conditions and different images can look the same under different illumination settings. However, Basri et al. [25] have proved that all possible illumination conditions can be well approximated by a 9-dimensional linear subspace, which has further directed the use of subspace clustering in this area [28, 29]. Motion segmentation involves segregating each of the moving objects in a video sequence and is very important for robotics, video surveillance, action recognition etc. Assuming each moving object has its own trajectory in the video, the motion segmentation problem reduces to clustering the trajectories of each of the objects [30], another subspace clustering problem. – In online social networks, the detection of communities having similar interests can aid both sociologists and target marketers [26]. Günnemann et al. [31] have applied subspace clustering on social network graphs for community detection. – In radio astronomy, clusters of galaxies can help cosmologists trace the mass distribution of the universe and further understand the origin of universe theories [32,33]. – Another important area of subspace clustering is web text mining through document clustering. There are billions of digital documents available today and each document is a collection of many words or phrases, making it a high-dimensional application domain. Document clustering is very important these days for efficient indexing, storage and retrieval of the digital content. Documents can group together 9 Chapter 1. Introduction differently under different sets of words. An iterative subspace clustering algorithm for text mining has been proposed by Li et al. [34]. In all of the applications discussed above, meaningful knowledge is hidden in lowerdimensional subspaces of the data, which can only be explored through subspace clustering techniques. In this thesis, we look into this research challenge of finding subspace clusters in high-dimensional data and propose efficient algorithms which are faster and scalable in dimensions. 1.4 Thesis organisaton The thesis contains 7 chapters. Chapter 2 surveys background material on the problem of clustering, presenting several existing approaches to cluster data. It discusses the clustering techniques used to tackle the high dimensionality, starting from trying to reduce the dimensionality to partitioning to subspace clustering. Chapter 3 explains the foundation of our approach, which is termed SUBSCALE, our novel algorithm for subspace of clustering high-dimensional data. Chapter 4 introduces the approaches to make SUBSCALE a scalable algorithm for bigger datasets both in terms of size and dimensions. Chapter 5 discusses the parallel approaches to subspace clustering for faster execution. Chapter 6 illustrates the applications of SUBSCALE in outlier characterisation and ranking for high-dimensional data. It also presents a case study of using SUBSCALE on a genes dataset. Chapter 7 concludes the thesis and presents directions for future research. Chapter 2 Literature Review 2.1 Introduction In this chapter, we present the literature related to clustering, in particular, subspace clustering. We focus more on the algorithms related to our solution and discuss their advantages as well as disadvantages. We also discuss the opportunities provided by parallel processing to increase the efficiency of clustering algorithms. One of the fundamental endeavours to explore and understand the data is to find those data points which are either similar or dissimilar. Classification and cluster analysis falls into the category of similarity based grouping of data while outlier detection fits into the latter. Classification is a supervised approach to group the data into already known classes or groups. Using a learning algorithm, predictions are made about which data point fits into which class. A recent survey on the state-of-the-art classification algorithms is presented in [35]. Clustering or cluster analysis is an unsupervised way of grouping similar data without any prior information about these groups [36]. Although clustering is more challenging than classification, it helps to discover the hidden clusters which cannot be known otherwise. The by-products of clustering are called outliers as these are the data 10 11 Chapter 2. Literature Review points which do not fit into any group and can provide further insights in the underlying data [37]. The history of cluster analysis can be traced back to 1950’s when one of the popular clustering algorithm K-means was developed [38, 39]. The clustering problem has been studied extensively in different disciplnes, including statistics [40], machine learning [41], image processing [26], bioinformatics [42] and data mining [5]. In fact, a search with the keyword ‘Data clustering’ on Google Scholar [43] found ∼ 3 million entries in year 2016. There are a number of surveys available on clustering algorithms along the timeline of their development [44–51]. Clustering algorithms can be broadly divided into two categories: partitioning (section 2.2) and non-partitioning (section 2.3). The partitioning algorithms like K-means [38], K-medoids, PROCLUS [13] divide the n data points into K clusters using some greedy approach to optimize the convergence criteria while the non-partitioning algorithms like DBSCAN [10] and CLIQUE [15] attempt to find all possible clusters without any predefined number of clusters. While clustering, these algorithms use either all of the dimensions together [10] or use the measurements in some [13] or all of the subsets of dimensions [15]. 2.2 Partitioning algorithms Partitioning algorithms iteratively relocate the data points from one cluster to another until a convergence criterion is met. These are more of a data relocation technique to divide the data into non-overlapping fixed number of regions (Figure 2.1). 2.2.1 K-means and variants K-means is one of the oldest clustering algorithm to partition the n data points into K non-overlapping clusters [38]. The K cluster centroids are initially selected at random or using some heuristics. The data points are assigned to their nearest centroids using 12 Chapter 2. Literature Review Original data Partitioned data Figure 2.1: Data partitioning. Euclidean distance. The algorithm then recomputes the centroids of the newer distribution of groups where a centroid is the mean of all the points belonging to that cluster. The data points are iteratively relocated until the algorithm converges. The objective function like minimum value of sum of the squared error is commonly used for convergence of the K-means algorithm. The sum of squared error for all K clusters where Ci is an ith cluster with µ as its centroid, K X X ||Pj − µi ||2 (2.1) i=1 Pj ∈Ci The complexity of the K-means algorithm is O(nkKT ) where n is the size of data, k is the number of dimensions, K is the number of clusters and T is the number of iterations. Although the K-means is very popular because of its simplicity and fast convergence, this algorithm is very sensitive to the outliers as they can skew the location of centroids. Other limitations include selection of K parameter and initial centroids, entrapment into local optima, inability to deal with the clusters of arbitrary shape and size. There have been many extensions to the K-means algorithm [39, 52]. For example, the K-medoid or partitioning around medoids (PAM) algorithm [53] uses the median of the data instead of their mean as centres of the clusters. As the median is less influenced by the extreme values than the mean, PAM is more resilient in the presence of the outliers. But other limitations remain. The CLARANS algorithm [54] is an improvement over K-medoid algorithm and is more effective for large datasets. The random samples 13 Chapter 2. Literature Review of neighbours are taken from the data and graph-search methods are used to iteratively obtain optimal K-medoids. However, the quadratic runtime of the CLARANS algorithm is prohibitive on large datasets. For high-dimensional data, K-means and its variants are unable to find clusters in the subspaces. 2.2.2 Projected clustering PROCLUS (PROjected CLUstering) [13] is a top-down projected clustering algorithm to find K non-overlapping clusters, each represented with associated medoid and subspace. The value of K and the average subspace size are given by the user. The PROCLUS algorithm randomly chooses a set of K potential medoids on a sample of points in the beginning. The iterative phase includes finding K good medoids, each associated with its subspace. The subspace for each of these K medoids is determined by minimizing the standard deviation of the distances of the points in the neighbourhood of the medoids to the corresponding medoid along each dimension. The points are reassigned to the medoids considering the closest distance in the relevant subspace of each medoid. Also, the points which are too far away from the medoids are removed as outliers. The output is a set of partitions along with the outliers. However, the user has to specify the number of clusters (K) as well as the number of subspaces. If the value of K is too small then the PROCLUS algorithm may miss out on some of the clusters entirely. Also, the PROCLUS algorithm can find clusters in different subspaces but of same size which can miss out on clusters in other subspaces. Additionally, the PROCLUS algorithm is biased toward clusters that are hyper-spherical in shape. The ORCLUS (ORiented projected CLUSter generation) [55] algorithm is similar to the PROCLUS algorithm except that it finds clusters in non-axis parallel subspaces by selecting principal components for each cluster instead of dimensions. The FINDIT 14 Chapter 2. Literature Review algorithm [56] is a variant of the PROCLUS algorithm and improve its efficiency and cluster quality using additional heuristics. All of these projected clustering algorithms do not discover all possible clusters in the data. Different groups of data can exhibit different clustering tendency under different subsets of dimensions. Rather than being subspace clustering algorithms, these are essentially space-partitioning algorithms. Any attempt to choose the subspaces or their size beforehand nullifies the idea of finding all possible unknown correlations among data. 2.3 Non-partitioning algorithms The non-partitioning clustering algorithms does not depend on the user to input the number of clusters and the relevant subspaces (if any). The aim of a clustering algorithm is to explore and identify the previously unknown clusters among the data without knowing the underlying structure. Any attempt to pre-determine the number of clusters or subspaces before the actual clustering process would dilute the whole idea of clustering. The non-partitioning algorithms help to identify all possible hidden clusters in the data without any user bias about the number of clusters or subspaces. These algorithms are largely based on the density measures of the data and play a pivotal role in finding arbitrary shaped clusters. The clusters are the dense regions separated by the sparse regions or regions of low density. There are two main categories of such algorithms: one is based on full-dimensional similarity measures and the other measures similarity among data points using relevant subset of dimensions. 2.3.1 Full-dimensional based algorithms DBSCAN DBSCAN [10] is a full dimensional clustering algorithm and does not need prior information about the number of clusters. According to the DBSCAN algorithm, a point is dense if it has τ or more points within distance. A cluster is defined as a set of such 15 Chapter 2. Literature Review Border Point C Core Point B A D Figure 2.2: Core and border data points in DBSCAN. dense points with intersecting neighbourhoods. The clustering process is based on the following five definitions: Definition 1 (-neighbourhood). Given a database DB of n points in k-dimensions, the -neighbourhood of a point Pi denoted by N (Pi ) is defined as: N (Pi ) = {Pj ∈ DB, ∈ R|dist(Pi , Pj ) < }, (2.2) dist() is a similarity function based on the distance between the values of the points. Section 1 in previous chapter discusses some of the commonly used distance measures. The cluster is defined by means of core data points as follows. Definition 2 (Directly density-reachable). Based on another parameter τ , a cluster has two kinds of points: core and border (Figure 2.2). If a point has atleast τ neighbours in its -neighbourhood, it is called a core point and all of these points in the neighbourhood are said to be directly density-reachable from it. A point is called a border point if it has less than τ neighbours in the -neighbourhood. A core point can never be directly density-reachable from a border point but a border point can be a part of a cluster if it belongs to -neighbourhood of some core point. In Figure 2.2, for example, A and B are core points and C is a border point. C is directlyreachable from B but B is not directly-density reachable from C. The direct density reachability is not symmetric if both points are not core points. 16 Chapter 2. Literature Review Definition 3 (Density-reachable). A point Py is density-reachable from a point Px if there is a chain of points P1 , . . . , Pn such that P1 = Px ,Pn = Py and Pi+1 is directly densityreachable from Pi . In Figure 2.2, the data point C is density-reachable from point A. A border point is reachable from a core point but not vice versa. A border point can never be used to reach other border points which might otherwise belong to the same cluster, for example points C and D in Figure 2.2. In that case, if they share a common core point from which both are density-reachable, then they both can be included in the cluster. Definition 4 (Density-connected). Two points Px , Py are said to be density-connected with each other, if there is a point Pz such that both Px and Py are density-reachable from Pz . Both density reachability and connectivity is defined with respect to same and τ parameters. Definition 5 (Cluster). A cluster consists of all density-connected points. If a point is density-reachable from a point in the cluster then that point is included in the cluster as well. The DBSCAN algorithm starts with an arbitrary point Px and if Px is a core point then DBSCAN retrieves all density-reachable points and add them to the cluster. If Px is a border point then next point is processed and so on. The DBSCAN algorithm is not sensitive to the outliers and can find clusters of arbitrary sizes and shapes with a complexity if O(n2 ). However, this algorithm uses all of the dimensions to measure the -neighbourhood. As the data gets sparsely distributed in high-dimensional space, this algorithm is unable to report meaningful clusters. 2.3.2 Subspace clustering All of the clustering algorithms discussed above are time tested and known to perform very well for low-dimensional data. But, these algorithms are not suitable for highdimensional data due to the curse of dimensionality. Also, they fail to give additional 17 Chapter 2. Literature Review information related to clusters that are relevant dimensions in which these clusters are more significant. Thus, it becomes imperative to find clusters hidden in lower-dimensional subspaces. Subspace clustering algorithms recursively find nested clusters using a bottom-up approach, starting with 1-dimensional clusters and merging the most similar pairs of clusters successively to form a cluster hierarchy. A number of subspace clustering algorithms have been proposed in the recent years. Agrawal et al. [15] were the first to introduce their famous CLIQUE algorithm for subspace clustering which is discussed below. We also discuss other subspace clustering algorithms: FIRES [57], SUBCLU [58], and INSCY [59], which are largely based on the DBSCAN algorithm [10]. CLIQUE The CLIQUE (CLustering In QUest) algorithm is based on the grid based computation to discover clusters embedded in the subset of dimensions. The clusters in a k-dimensional space are seen as hyper-rectangular regions of dense points iteratively built from lowerdimensional hyper-rectangular clusters. The agglomerative cluster generation process in the CLIQUE algorithm is based on the Apriori algorithm which was originally used for the frequent item-set mining [14] and is discussed in chapter 1(section 1.2.1). According to the downward closure property of the Apriori principle, if a set of points is a cluster in a k-dimensional space then this set will be part of a cluster in the (k − 1)-dimensional space. The anti-monotonicity property of this principle helps to drastically reduce the search space for iterative bottomup clustering process. Initially, each single dimension of the data space is partitioned into equal-sized ξ units using a fixed size grid. A unit is considered dense if the number of points in it exceeds the density support threshold, τ . Only those units which are dense are retained and others are discarded. The clustering process involves generation of k-dimensional candidate units by self-joining those (k − 1)-dimensional units which share first k − 2 dimensions in 18 Chapter 2. Literature Review common, assuming that dimensions attached to each dense units are in sorted order. At each step, the candidate units which are not dense are discarded and the rest are processed to generate higher dimensional candidate units. Thus, 1-dimensional base units in k single dimensions are combined using self-join to form 2-dimensional candidate units and out of these 2-dimensional units, non-dense units are discarded and the rest are combined to form 3-dimensional candidate units and so on. Finally, at each k th subspace, the clusters are formed by computing the disjoint sets of connected k-dimensional units. At the end of this recursive clustering process, we have a set of clusters in their highest possible subspaces. These clusters can lie in the same, overlapping or disjoint subspaces. The CLIQUE algorithm is insensitive to the outliers and can find arbitrary shaped clusters of varying sizes. Most importantly, for each cluster, additional information about the relevant subset of dimensions is also given. The time complexity of the CLIQUE algorithm is O(cp + pn) where c is a constant, p is the dimensionality of highest subspace found and n is the number of input data points. The complexity grows exponentially with dimensions. The main inefficiency of the CLIQUE algorithm comes from generation of large number of redundant dense units during the process. There is no escape from computation of these redundant units as they have to be generated at each of the 1st , 2nd , . . . , (k − 1)th dimensional subspaces, before a maximal cluster at k-dimensional subspace is found. The maximal subspace clusters were introduced in section 1.2.1. Although these dense units are pruned as the algorithm progresses in higher dimensions, it is the first few lowerdimensional subspaces which generate larger shares of these dense units. For example, a k-dimensional data would have k × (k − 1) 2-dimensional subspaces. As each dimension is divided into ξ units, each 2-dimensional subspace will have to self join ξ 2 units. In total, there will be k × (k − 1) × ξ 2 units to be self-joined. The self join further adds on to the time complexity by comparing and checking each and every point in the adjacent units. 19 Chapter 2. Literature Review The computational expense of generating and combining dense units at each stage of the recursive process causes the CLIQUE algorithm to break down for high-dimensional data. CLIQUE extensions The MAFIA (Merging of Adaptive Finite IntervAls) [60] algorithm proposed improvement over the CLIQUE algorithm through better cluster quality and efficiency. It introduced adaptive grids which are semi-automatically built based on the data distribution and it uses the same bottom up cluster generation process starting from 1-dimension. Although, MAFIA yields upto two orders of magnitude speed-up as compared to CLIQUE, the execution time of MAFIA grows exponentially with the dimensionality of data. ENCLUS (ENtropy based CLUStering) [61] is another algorithm similar to the CLIQUE algorithm but uses the concept of entropy from information theory to find the relevant subspaces for clustering. The underlying premise is that a uniform distribution of data will have a higher entropy than the skewed data distribution. Therefore, the entropy of subspaces having regions of dense units will be low. Based on an entropy threshold, subspaces are selected for clustering. Entropy also helps to prune the subspaces similar to downward closure property of Apriori principle. If a k dimensional subspace has a lower entropy, then (k − 1) dimensional subspace will also have a lower entropy. The benefit of using entropy is that the ENCLUS algorithm can find extremely dense and small clusters which were otherwise ignored by the CLIQUE algorithm. Yet, the additional cost of finding entropy of each and every subspace makes this algorithm infeasible for high-dimensional data. SUBCLU The SUBCLU [58] algorithm relies on DBSCAN to detect clusters in each of the subspaces. Similar to the previous bottom-up clustering approaches, it uses Apriori principle to prune through the subspaces, and also generates all lower-dimensional trivial clusters. 20 Chapter 2. Literature Review FIRES Kriegel et al. proposed FIRES (FIlter REfinement Subspace clustering) [57] which is a hybrid algorithm to find approximate subspace clusters directly from 1-dimensional clusters. Although it uses a bottom-up search strategy to find maximal cluster approximations, it does not incorporate step-by-step Apriori style. The FIRES algorithm consists of three phases: pre-clustering, generation of subspace cluster approximations and postprocessing of subspace clusters. During the preprocessing phase, FIRES computes 1-dimensional clusters called base clusters and any clustering technique like DBSCAN, K-means or others can be used to generate these base clusters. The smaller clusters are discarded in this phase. In the second phase, the ‘promising’ candidates from the 1-dimensional base clusters are chosen based on the similarity among them. FIRES defines similarity of clusters by the number of intersecting points and heuristics are used to select the most similar base clusters. The resulting clusters represent hyper-rectangular approximations of the subspace clusters. In the post-processing step, the structures of these approximations are further refined. FIRES does not employ the exhaustive search procedure to find all possible subspace clusters and therefore, outperforms SUBCLU and CLIQUE in terms of scalability and runtime with respect to data dimensionality. However, this performance boost is compensated by the cost incurred due to the loss of clustering accuracy. FIRES does not discover all of the hidden subspace clusters and only gives heuristic approximations of subspace clusters which may or may not overlap. INSCY Assent et al. proposed INSCY algorithm [59] for the subspace clustering which is an extension of the SUBCLU algorithm. They use a special index structure called a SCYtree which can be traversed in the depth first order to generate high-dimensional clusters. Their algorithm compares each data point of the base cluster and enumerates them implicitly in order to merge the base clusters for generating the higher-dimensional clusters. 21 Chapter 2. Literature Review The search for the maximal subspace clusters by the INSCY algorithm is quite exhaustive as it implicitly generate all intermediate trivial clusters during the bottom up clustering process. The complexity of the INSCY algorithm is O(2k |DB|2 ), where k is the dimensionality of the maximal subspace cluster and |DB| denotes the size of the dataset. Also, Muller et al. [62] proposed an approach for subspace clustering which reduces the exponential search space while generating intermediate clusters through selective jumps. But again their algorithm depends upon counting the points across candidate hyper-rectangles to determine their similarity and preference. 2.4 Desirable properties of subspace clustering We have identified the following desirable properties which should be satisfied by a subspace clustering algorithm for a k-dimensional dataset of n points: 1. The groupings among data points vary under different subsets of dimensions. Although the clusters within the same subspace are disjoint, the clusters from different subspaces can be partially-overlapping and share some of the data points among them. Therefore, a subspace clustering algorithm should extract all possible clusters in which a data point participates. For example, if a cluster C in a subspace {1, 3, 4} contains points {P3 , P6 , P7 , P8 } and another cluster C 0 in a subspace {1, 3, 6} contains points {P1 , P3 , P4 , P6 }, both of the clusters C and C 0 should be detected. Note that both points P3 and P6 are participating together in two different clusters in different subspaces. 2. The subspace clustering algorithm should give only non-redundant information, that is, if all the points in a cluster C are also present in a cluster C 0 and the subspace in which C exists is a subset of the subspace in which the cluster C 0 exists, then the cluster C should not be included in the result, as the cluster C does not give any additional information and is a trivial cluster. 22 Chapter 2. Literature Review A strong conformity to this criterion would be that such redundant lower-dimensional clusters are not generated at all, as their generation and pruning later on leads to the higher computational cost. In other words, the subspace clustering algorithm should output only the maximal subspace clusters. As discussed earlier, a cluster is in a maximal subspace if there is no other cluster which conveys the same grouping information between the points as already given by this cluster. The cluster C 0 is thus, a maximal cluster while cluster C is a non-maximal or trivial cluster. The K-means based partitioning algorithms are meant to only find a predefined number of clusters using full-dimensional distance among the data points. The clusters existing in the subspaces of high-dimensional data cannot be discovered using these techniques. Therefore, both of the desirable criteria for efficient subspace clustering cannot be applied to these algorithms. The projected clustering algorithms like PROCLUS can find clusters in the subspaces but fail to detect all maximal clusters and do not conform to the 2nd criterion of desirable properties described above. Neither do these algorithms satisfy the 1st criterion, as only a user defined number of clusters is detected. The non-partitioning clustering algorithms like DBSCAN which are based on fulldimensional space, does not fall under the category of subspace clustering and thus, both criteria on desirable properties can be skipped from the discussion. The hierarchical clustering based algorithms like CLIQUE and SUBCLUE satisfy the 1st criterion of the desired subspace clustering algorithm and can find all of the arbitrary shaped clusters, but they fail to satisfy the 2nd criterion as they still generate many trivial clusters. INSCY algorithm too cannot strongly conform to the 2nd criterion of the desired subspace clustering algorithm. FIRES algorithm fails to satisfy both of the criteria as it does not output all possible clusters and also generate redundant clusters along the process. It is no doubt that subspace clustering is an expensive process. Due to the numerous applications of subspace clustering as discussed in previous chapter, there is an urgent need for efficient solutions to the subspace clustering problem. Exploring all of the subspaces for possible clusters is a challenge. The need for enumerating points in O(2k ) 23 Chapter 2. Literature Review subspaces using the multi-dimensional index structure introduces the computational cost as well as the inefficiency. All of these subspace clustering algorithms discussed so far suffer from the lack of the efficient indexing structures for the enumeration of the points in the multi-dimensional subspaces and also, require multiple database scans. The generation of trivial clusters adds to the complexity. The optimal solution to subspace clustering problem is to generate only maximal clusters with minimal database scans. In the next chapter, we overcome the limitations of existing clustering algorithms by proposing a novel approach to efficiently find all possible maximal subspace clusters in high-dimensional data. Our approach fully conforms to both of the desirable criteria of a true subspace clustering algorithm. Chapter 3 A novel fast subspace clustering algorithm 3.1 Introduction High-dimensional data poses its own unique challenges for clustering. The baseline fact behind these challenges is that the data group together differently under different subsets of dimensions. For better insight into the underlying data, it is important to know the relevant dimensions associated with each group of similar points called cluster. Subspace clustering algorithms are the key to discover such inter-relationships between clusters and the subsets of dimensions called subspaces. As we do not have any prior information about hidden clusters and the relevant subset of dimensions, an exhaustive search of all subspaces seems necessary. Subspace clustering through a bottom-up hierarchical process promises to find all possible subspace clusters. We have discussed some of these state-of-the-art algorithms in chapter 2. However, the exponential increase in number of subspaces with the dimensions, makes the subspace clustering process extremely expensive. As we discussed in chapter 2, there are some pruning techniques employed by various subspace clustering algorithms to reduce this search space but redundant clusters are still generated at each stage of the hier- 24 25 Chapter 3. A novel fast subspace clustering algorithm Table 3.1: Marks dataset Student id S1 S2 S3 S4 S5 mathematics 10 9.6 4 1.6 1.5 science 8 7.6 7.8 7.7 9 arts 2 8 2.2 2.3 5.2 archical process. Although these clusters are eliminated later on, their generation during the clustering process adds to the computational expense. Also, merging of dense units using self-join and other point-wise matching and comparing techniques brings in further inefficiency. In this chapter, we present our novel solution to subspace clustering problem for highdimensional data. Before explaining our approach, we revisit the subspace clustering problem using examples. 3.1.1 Exponential search space In Table 3.1, a dummy Marks dataset of 5 students consisting of their examination marks measured over three subjects (dimensions): mathematics, science and arts is shown. It might be interesting to find which groups of students perform similarly in which of the exams. Two students might perform similarly in mathematics and science but not in arts. Some other students might score similar marks in all three of mathematics, science and arts. If we assume a similarity distance of 0.5, as shown in Table 3.2, there is one cluster each in the subspaces: {mathematics, science} and {science, arts}. Also, no two students have similar marks within the range of 0.5 distance in the subspace: {mathematics, arts}. With just three attributes in the above example, there are 23 − 1 possible ways to decipher the relevant subspaces of similar points. As the number of dimensions grows from three to hundreds or thousands or higher, there is an exponential growth of possible 26 Chapter 3. A novel fast subspace clustering algorithm Table 3.2: Clusters in the Marks dataset subspace {mathematics} {science} {arts} {mathematics, science} {mathematics, arts} {science, arts} {mathematics, science, arts} clusters {S1, S2} and {S4, S5} {S1,S2, S3, S4} {S1, S3, S4} {S1, S2} nil {S1, S3, S4} nil subspaces which can contain clusters. For efficient clustering, it is important to reduce this search space without any information loss. 3.1.2 Redundant clusters We note in Table 3.2, a student can participate in more than one cluster in different subspaces (for example student S1). Thus, the clusters from different subspaces can overlap. The overlapping of clusters can also happen within the same subspace but they can be connected together through common points to get maximal coverage as proposed in CLIQUE or DBSCAN algorithms. For example, if students S1 and S2 received similar marks in mathematics and S1 and S3 also received similar marks in mathematics, then we can say that, all three students S1, S2, and S3 scored similar marks in mathematics. Two overlapping clusters from different subspaces represent relationships among points under different circumstances. Sometimes it might be feasible to combine two overlapping clusters from different subspaces. For example, if students {S1, S2} score similar in mathematics and students {S1, S3} score similar in science, then it cannot be predicted that S3 also scored similar to S1 in mathematics or S2 scored similar to S1 in science. But if the attached subspaces of two clusters form a hierarchical relationship then there are situations when one of them can be eliminated. For example, in Table 3.2, there are two disjoint groups of students who score similar in mathematics: {S1, S2} and {S4, 27 Chapter 3. A novel fast subspace clustering algorithm S5}. The group {S1, S2} also score similar in the subspace: {mathematics, science}. As the group {S1, S2} is redundantly present in both subspaces and one of them can be discarded. The number of redundant clusters grows tremendously with the increase in the number of subspaces. The pruning techniques like Apriori algorithm helps to reduce the number of such overlapping clusters by eliminating the ones present in lower-dimensional subsets of a subspace. 3.1.3 Pruning and redundancy According to anti-monotonicity property of the Apriori principle, if a set of points forms a cluster in a k-dimensional subspace S then it will be part of a cluster in the (k − 1)dimensional subspace S 0 such that S 0 ⊂ S. Thus, a cluster from a higher-dimensional subspace S will be projected as a cluster in all of the 2k − 1 lower-dimensional subspaces. Considering Figure 3.1, suppose there is a cluster C present in a 3-dimensional subspace {1, 3, 4} then it will also be present in all the subsets of this subspace, {1, 3}, {1, 4}, {3, 4} , {1}, {2}, and {3}. Let there be no superset of subspace {1, 3, 4} which contains cluster C which means C is a maximal subspace. The cluster C in the maximal subspace {1, 3, 4} gives the same grouping information as provided by the combined subspaces in the subsets of {1, 3, 4}. It is therefore sufficient to find clusters in only their maximal subspaces. It is not necessary to generate non-maximal clusters because they are trivial, but most of the algorithms implicitly or explicitly compute them. The subspace clustering algorithms based on the bottom-up approach find the maximal subspace clusters using step-by-step hierarchical cluster building process. Starting from 1-dimensional subspace, the clusters in (k − 1)-dimensional subspace are combined to generate candidate clusters in the k-dimensional subspace. Then non-dense candidates are discarded in the k-dimensional subspace and rest of the dense clusters are again combined together to find (k + 1)-dimensional candidate clusters and so on. Only after finding the k-dimensional clusters, (k − 1)-dimensional clusters can be eliminated, but not before 28 Chapter 3. A novel fast subspace clustering algorithm Figure 3.1: Typical iterative bottom up generation of clusters based on the Aprioriprinciple. Dense points in 1-dimensional subspace are combined to compute twodimensional clusters which are then combined to compute three-dimensional clusters and so on. that. A large number of these redundant clusters are generated for higher-dimensions and would have already added to the runtime cost before their actual elimination starts. 3.1.4 Multiple database scans and inter-cluster comparisons In addition to mandatory detection of the redundant non-maximal clusters, another inherent problem of step-by-step bottom up clustering algorithms is multiple database scans. An initial database scan is required to generate the 1-dimensional dense clusters. Then, while generating k-dimensional clusters, another database scan is required at each stage to check the occupancy of each candidate and eliminate the non-dense candidates. Along with these database scans, another inefficiency comes from the need to compare each k − 1-dimensional cluster with all of the other k − 1 dimensional clusters during merging phase. The comparison between two set of data points checks for each and every point in both clusters and merge them accordingly. The number of clusters in the merged pool is much larger in lower-dimensional subspaces than in higher-dimensions. 29 Chapter 3. A novel fast subspace clustering algorithm A k-dimensional subspace has 2k lower-dimensional subspaces and the clusters at each of these subspaces need to be compared with each other to generate next higherdimensional candidate. The repeated occupancy check for density and large number of inter-cluster comparisons increases both time and space complexity. The inefficiency increases drastically with the increase in dimensions. In this chapter, we present a novel algorithm called SUBSCALE which tackles all of the challenges faced by the current subspace clustering algorithms much more efficiently. Our algorithm eliminates the need to generate and process redundant subspace clusters, does not require multiple database scans, and above all provides a new technique to compare dense sets of points across dimensions. The SUBSCALE algorithm is far more scalable with the dimensions as compared to the existing algorithms and is explained in details in the next section. 3.2 Research design and methodology Continuing with the monotonicity of the Apriori principle, a set of dense points in a kdimensional space S is dense in all lower-dimensional projections of this space [15]. In other words, if we have the dense sets of points in each of the 1-dimensional projections of the attribute-set of a given data, then the sufficiently common points among these 1dimensional sets will lead us to the dense points in higher-dimensional subspaces. Based on this premise, we develop our algorithm to efficiently find the maximal clusters in all possible subspaces of a high-dimensional dataset. Before we explain our novel idea to find subspace clusters, we would like to formally define the problem space. 3.2.1 Definitions and problem Let DB = {P1 , P2 , . . . , Pn } be a database of n points in a k-dimensional space. The k dimensions are represented by an attribute-set A : {d1 , d2 , . . . , dk }. Each point Pi in the database DB is a k-dimensional vector, {Pi1 , Pi2 , . . . , Pik } such that, Pid , 1 ≤ d ≤ k 30 Chapter 3. A novel fast subspace clustering algorithm denotes the value measured for a point Pi in the dth dimension. Pid is also called the projection of point Pi in the dth dimension. The database DB can also be seen as an n × k matrix. Subspace A subspace S is a subset of the dimensions from the attribute-set A : {d1 , d2 , . . . , dk }. For example, S : {dr , ds } is a 2-dimensional subspace consisting of dimension dr and ds and the projection of a point Pi in this subspace is {Pidr , Pids }. For the sake of simplicity, we will use only subscript to denote a dimension, therefore, the subspace S in this case becomes {r, s}. Also, we use the term ‘c-D’ to represent any c-dimensional point or group of points, for example, 2-D means a two dimensional point or group of points. The dimensionality of a subspace refers to the total number of dimensions in it. A single dimension can be referred to as a 1-dimensional or 1-D subspace. A subspace with a dimensionality a is a higher-dimensional subspace compared to another subspace with a dimensionality b, if a > b. Also, a subspace S 0 with dimensionality b is a projection of another subspace S of dimensionality a, if a > b and S 0 ⊂ S, that is, all the dimensions participating in S 0 are also contained in the subspace S. Density concepts We adopt the definition of density from DBSCAN [10] which is based on two user defined parameters and τ , such that, a point is dense if it has at least τ points within its neighbourhood of distance. The connectivity among the dense points is used to identify arbitrary shaped clusters. We refer to section 2.3.1 in chapter 2 for the formal definitions of density based connectivity between points. These dense points can be easily connected to form a subspace cluster. 31 Chapter 3. A novel fast subspace clustering algorithm Definition 6 (Maximal subspace cluster). A subspace cluster, C = (P, S) is a group of dense points P in a subspace S, such that ∀Pi , Pj ∈ P , Pi and Pj are density connected with each other in the subspace S with respect to and τ and there is no other point Pr ∈ P , such that Pr is density-reachable from some Pq ∈ / P in the subspace S. A cluster Ci = (P, S) is called a maximal subspace cluster if there is no other cluster Cj = (P, S 0 ) such that S 0 ⊃ S. The maximality of a particular subspace is always relative to a cluster. A subspace which is maximal for a certain group of points might not be maximal for another group of points. For example, a cluster C1 : {P1 , P2 , P3 , P4 } might exist in a subspace S : {d1 , d2 , d4 , d6 } such that there is no superset of S which contains C1 . Thus, S is a maximal subspace for C1 . While another cluster C2 : {P1 , P5 , P6 , P7 } might exists in a maximal subspace S 0 : {d1 , d2 }. Here, subspace S 0 is maximal for cluster C2 . Although subspace S 0 is a subset of subspace S, both subspaces are relevant and maximal for different clusters. Some of the related literature treat the maximality of the clusters in terms of the inclusion of all possible density-connected points (in a given subspace) into one cluster. We call it an inclusive property of clustering algorithm. Our ‘maximal subspace clusters’ are both inclusive (with respect to the points in a given subspace) and maximal (with respect to lower-dimensional projections). In the next subsection, we explain the main ideas underlying the SUBSCALE algorithm. 3.2.2 Basic idea Consider an example given in Figure 3.2, the two-dimensional Cluster5 is an intersection of its 1-D projections of points in Cluster1 and Cluster2 . Also, we note that the projections of the points {P7 , P8 , P9 , P10 } on d2 -axis form a 1-D cluster (Cluster3 ), but there is no 1-D cluster in the dimension d1 which has the equivalent points in it, which justify the absence of a two-dimensional cluster containing these points in the subspace {d1 , d2 }. 32 Chapter 3. A novel fast subspace clustering algorithm P1 P2 P3 P4 Dimension d2 P5 P7 P9 P11 P12 P13 P14 Dimension d1 Figure 3.2: Basic idea behind the SUBSCALE Algorithm. The projections of the points {P7 , P8 , P9 , P10 } on d2 -axis form a 1-D cluster, that is, Cluster3 , but no 1-D cluster in dimension d1 have the same points as Cluster3 , therefore, absence of corresponding 2dimensional cluster containing these points in the subspace {d1 , d2 }. Given an m-dimensional cluster C = (P, S) where, S = {d1 , d2 , . . . dm }, the projections of the points in P are dense points in each of the single dimensions, {d1 }, {d2 }, . . . , {dm }. It implies that if a point is not dense in a 1-dimensional space then it will not participate in the cluster formation in higher subspaces containing that dimension. Thus, we can combine the 1-D dense points in m different dimensions to find the density-connected sets of points in the maximal subspace S. Recall that a point is dense if it has at least τ neighbours in neighbourhood with respect to a distance function dist(). In 1-dimensional subspaces, the L1 metric can be safely used as a distance function to find the dense points. Observation 1. If at least τ +1 density-connected points from a dimension di also exist as density-connected points in the single dimensions dj , . . . , dr , then these points will form a set of dense points in the maximal subspace, S = {di , dj , . . . dr }. To illustrate further, let there be four clusters: red, green, blue, purple in higherdimensional subspaces. These four clusters can be in the same subspace or may be in different subspaces of different dimensionality. Assume these subspaces are maximal for 33 Chapter 3. A novel fast subspace clustering algorithm di dj dk Figure 3.3: Projections of clusters in a high-dimensional subspace are dense across participating dimensions. these clusters and contain at least three dimensions di , dj and dk . As discussed before, all of these three dimensions will have the projections of these four clusters as shown in Figure 3.3. An important observation in Figure 3.3 is that the dense projections in the single dimensions can exist as intermixed with other neighbouring dense points. In dimensions dk for example, the points of green cluster are mixed with points from another pink cluster and the pink cluster does not exist in other two dimensions. The challenge is how to connect these dense points from different 1-dimensional spaces to form a maximal subspace cluster. The naive way is to first find the density-connected points in each dimension and then find intersections of all of the density-connected points in all of the single dimensions. Each density-connected set can have different number of points in it and there can be different number of density-connected sets in each dimension. Comparing each and every point across dimensions is not an efficient way for high-dimensional data. Another approach can be to divide these density-connected points into smaller units and instead of comparing each and every point of density-connected sets, simply compare these units across dimensions to check if they contain identical points in them as shown in Figure 3.4. Following Definition 3, each point in a subspace cluster will belong to the neighbourhood of at least one dense point. Therefore, the smallest possible projection of a cluster from higher-dimensional subspace is of cardinality τ + 1 and let us call it a dense unit, d U. If U1di and U2 j are two 1-D dense units from the single-dimensions di and dj respec- 34 Chapter 3. A novel fast subspace clustering algorithm di dj dk Figure 3.4: Matching dense units across dimensions. tively, we say that these are the same dense units if they contain the same points, that is, d d U1di = U2 j , if ∀Pi [Pi ∈ U1di ↔ Pi ∈ U2 j ]. Observation 2. Following the observation 1, if the same dense unit U exists across m single dimensions, then U exists in a maximal subspace spanned by these m dimensions. In order to check if two dense units are same, we propose a novel idea of assigning signatures to each of these 1-D dense units. The rationale behind this is to avoid comparing the individual points among all dense units in order to decide whether they contain exactly the same points or not. We can hash the signatures of these 1-D dense units from all k dimensions and the resulting collisions will lead us to the maximal subspace dense units (Observation 2). Our proposal for assigning signatures to the dense units is inspired by the work in number theory by Erdös and Lehner [63] which we explain in detail below. 3.2.3 Assigning signatures to dense units If L ≥ 1 is a positive integer, then a set {a1 , a2 , . . . , aδ } is called its partition, such that P L = δi=1 ai for some δ ≥ 1 and ai > 0 is called a summand. Also, let pδ (L) be the total number of such partitions, when each partition has at most δ summands. Erdös and Lehner [63] studied these integer partitions by probabilistic methods and 1 gave an asymptotic formula for δ = o(L 3 ), L−1 δ−1 pδ (L) ∼ δ! (3.1) 35 Chapter 3. A novel fast subspace clustering algorithm Observation 3. Assume K be a set of random large integers, δ |K| pδ (L). Let U1 1 and U2 be the two sets of integers drawn from K s.t. |U1 | = |U2 | = δ and δ = o(L 3 ). Let us denote the sums of the integers in these two sets as sum(U1 ) and sum(U2 ) respectively. We observe that if sum(U1 ) = sum(U2 ) = L, then U1 and U2 are same with an extremely high probability, if L is very large. Proof. From Equation 3.1, for a very large positive integer L, if we take relatively very small partition size δ, then the number of unique fixed-sized partitions will be astronomically large. And the probability of getting a particular partition set of size δ is: L−1 δ−1 L (L − 1)!δ!(L − δ)! 1 / = = δ! (δ − 1)!(L − δ)!δ!L! L(δ − 1)! δ (3.2) It means the probability of randomly choosing the same partition again is extremely low. And this probability can be made very small by choosing a large value of L and relatively very small δ. Since L is the sum of the labels of δ points in a dense unit U, L can be made very large if we choose very large integers as the individual labels. Thus, with δ = τ + 1, the two dense units U1 and U2 will contain the same points with very high probability, if sum(U1 ) = sum(U2 ), provided this sum is very large. We randomly generate a set K of n large integers and use a one-to-one mapping M : DB 7→ K to assign a unique label to each point in the database. The signature Sig of a dense unit U is given by the sum of the labels of the points in it. Thus, relying on observation 3, we can match these 1-D signatures across different dimensions without dm checking for the individual points contained in these dense units, e.g., if U1d1 , U2d2 , . . . Um are m dense units in m different single dimensions, with their points already mapped to the large integers, we can hash their signature-sums to a hash table. If all the sums collide then these dense units are same (with very high probability) and exist in the subspace {d1 , d2 , . . . , dm }. Thus, the final collisions after hashing all dense units in all dimensions generate dense units in the relevant maximal subspaces. We can combine these dense units to get final clusters in their respective subspaces. 36 Chapter 3. A novel fast subspace clustering algorithm 100,000,000 102 trials 103 trials 104 trials 105 trials 106 trials 107 trials No. of Collisions 1,000,000 10,000 100 1 0 1 2 3 4 5 6 7 8 9 10 11 12 No. of digits used for labels Figure 3.5: A trial includes drawing a label set of 4 random integers with the same number of digits e.g., {333, 444, 555, 666} is a sample set where no. of digits is 3. The probability of drawing same set of integers reduces drastically with the use of larger integers as labels in the set, e.g, ≈ 10 collisions for 100 million trials when the label is a 12-digit integer. Experiments with large integers We did a few numerical experiments to validate observation 3 and the results are shown in Figure 3.5. A trial consists of randomly drawing a set of 4 labels from a given range of integers, e.g., while using 6-digit integers as labels, we have a range between 100000 and 999999 to choose from. All labels in a given set in a given trial uses the same integer range. Each time we fill a set with random integers, we store its sum in a common hash table (separate for each integer range). A collision occurs when the sum of a randomly drawn set is found to be the same as an already existing sum in the hash table. We note that the number of collisions is indeed very small when large integers are used as labels, e.g, there are about 10 collisions (a negligible number) for 100 million trials when the label is a 12-digit integer. Also, we observed the probability of collisions by experimenting with different cardinalities of such partition sets drawn at random from a given integer range. We note in 37 Chapter 3. A novel fast subspace clustering algorithm 1,000,000 |Set| = 3 |Set| = 10 |Set| = 20 |Set| = 30 |Set| = 50 100,000 No of collisions 10,000 1,000 100 10 1 0 2 4 6 No. of digits used for labels 8 10 Figure 3.6: Numerical experiments for probability of collisions with respect to the number of random integers drawn at a time. |Set| denotes cardinality of the set drawn. The number of trials for each fixed integer-set is 1000000. Figure 3.6 that the probability of collisions decreases further with higher values of set sizes. The gap between the number of collisions widens for the larger integer ranges. 3.2.4 Interleaved dense units Although matching 1-D dense units across 1-dimensional subspaces is a promising approach to directly compute the maximal subspace clusters, it is however difficult to identify these 1-D dense units. The reason is that 1-dimensional subspaces may contain interleaved projections of more than one cluster from higher-dimensional spaces, e.g., in Figure 3.2, Cluster1 contains projections from both Cluster5 and Cluster6 . The only 1| way to find all possible dense units from Cluster1 is to generate all possible |Cluster τ +1 combinations. Let CS be a core-set of such interleaved 1-D dense points such that, each point in this core set is within distance of each other, |CS| > τ . A core-set of the points in a subspace S can be denoted as CS S . In our algorithm, we first find these core-sets in each dimension and then generate all combinations of size τ + 1 as potential dense units. 38 Chapter 3. A novel fast subspace clustering algorithm As can be seen from Figure 3.2, many such combinations in the d1 dimension will not result in dense units in the subspace {d1 , d2 }. Moreover, it is possible that none of the combinations will convert to any higher-dimensional dense unit. The construction of 1-dimensional dense units is the most expensive part of our algorithm, as the number of combinations of τ + 1 points can be very high depending on as the value of will determine the size of the core-sets in each dimension. However, one clear advantage of our approach is that this is the only time we need to scan the dataset in the entire algorithm. 3.2.5 Generation of combinatorial subsets Assuming a core-set CS of c data points, all dense units of size r can be generated from c CS using combinations. There are many algorithms available in the literature to find r the combinatorial sequences [64, 65]. We generate the combinatorial subsets of core-sets in a lexicographic order such that each combination is a sequence l1 , l2 , . . . , lr such that 1 < l1 < l2 < · · · < lr < c. For example, following are the 10 combinations of size 3 generated from a set {1, 2, 3, 4, 5}: i: < 1, 2, 3 > ii: < 1, 2, 4 > iii: < 1, 2, 5 > iv: < 1, 3, 4 > v: < 1, 3, 5 > vi: < 1, 4, 5 > vii: < 2, 3, 4 > viii: < 2, 3, 5 > ix: < 2, 4, 5 > 39 Chapter 3. A novel fast subspace clustering algorithm x: < 3, 4, 5 > Using the initial combination sequence < 1, 2, 3 > as a seed, the next lexicographic sequence can be generated iteratively using the predecessor as shown in the Algorithm 1. We notice in the above combinations that each position in the last sequence < 3, 4, 5 > has reached its saturation point. A position i in a combinatorial sequence is said to have reached its saturation point if it cannot take any larger value, that is when it has reached the maximum possible value of c − r + i. Starting with the position r of the predecessor, we backtrack towards the first position until a position is still active that is, it has not reached its saturation point. The next sequences are generated from an active position to the rth position as shown in step 15 to 17 in Algorithm 1 (getDenseUnits). The algorithm stops when all of the r positions have reached their saturation point. The initial seed is set to < 0, c − r + 2, c − r + 3, . . . , c. The SUBSCALE algorithm is explained in the next subsection. 3.2.6 SUBSCALE algorithm As discussed before, we aim to extract the subspace clusters by finding the dense units in the relevant subspaces of the given dataset by using L1 metric as the distance measure. We assume a fixed size of τ + 1 for the dense unit U which is the smallest possible cluster of dense points. If |CS| is the number of points in a 1-D core-set CS then, we can obtain |CS| dense units from one such CS. If we map the data points contained in each dense τ +1 unit to large integers and compute their sum then each such sum will create a unique signature. From observation 3, if two such signatures match then their corresponding dense units contain the same points with a very high probability. In the SUBSCALE algorithm, we first find these 1-D dense units in each dimension and then hash their signatures to a common hash table (Figure 3.7). We now explain our algorithm for generating maximal subspace clusters in highdimensional data. 40 1 2 3 4 5 Chapter 3. A novel fast subspace clustering algorithm Input: CS: A core-set of c points; r: size of each combination to be generated from the core-set. c Output: DenseU nits: A set of dense units. r seed and U are empty arrays of size r each. for i ← 2 to r do seed[i] ← c − r + i end seed[1] ← 0 /* This step will make sure that first lexicographic sequence will be generated as a seed. */ 6 7 8 9 while true do i←r while i > 0 and seed[i] = c − r + i do Decrement i /* Get the active position. 10 11 12 end if i = 0 then break /* It signifies all combinations have been generated. 13 14 */ */ else temp ← seed[i] /* Get seed element. */ for j ← i to r do 16 k ← temp + 1 + j − i 17 seed[j] ← k 18 U[j] ← CS[k] 19 end 20 Copy dense unit U to the output set of DenseU nits 21 end 22 end Algorithm 1: getDenseUnits: Find all combination subsets of size r from a core-set of size c. 15 41 Chapter 3. A novel fast subspace clustering algorithm Figure 3.7: Signatures from different dimensions collide to identify the relevant subspaces for corresponding dense units behind these signatures. di is the ith dimension and Sigxi is the signature of a dense unit, Uxi in ith dimension. Step 1: Consider a set, K of very large, unique and random positive integers {K1 , K2 , . . . , Kn }. We define M as a one-to-one mapping function, M : DB → K. Each point Pi ∈ DB is assigned a unique random integer Ki from the set K. Step 2: In each dimension j, we have projections of n-points, P1j , P2j , . . . , Pnj . We create all possible dense units containing τ + 1 points that are within an distance. Step 3: Next, we create a hash table hT able, as follows. In each dimension j, for every dense unit Uaj , we generate its signature Sigaj . A signature is calculated by mapping the elements of dense unit Uaj to their corresponding keys from the key database K and summing them up. The signature thus generated is hashed into the hT able. Using observation 3, if Sigaj collides with another signature Sigbk then the dense unit Uaj exists in subspace {j, k} with extremely high probability. After repeating this process in all single dimensions, each entry of this hash table will contain a dense unit in the maximal subspace. The colliding dimensions are stored along with each signature Sigi ∈ hT able. Step 4: We now have dense units in all possible maximal subspaces. We can use any full dimensional clustering algorithm on each subspace to process these maximal dense units into maximal subspace clusters. In our research, we use DBSCAN in each found subspace for the clustering process. The and τ parameters can be 42 Chapter 3. A novel fast subspace clustering algorithm P1, P7, P3, P12, P5, P4, P9, P2, P6, ... Figure 3.8: An example of sorted data points in a single dimension adapted differently as per the dimensionality of the subspace to handle the curse of dimensionality. The pseudo code is given in Algorithm 2 (SUBSCALE), Algorithm 3 (FindSignatures) and Algorithm 4 (findSum) below. The Algorithm 2 (SUBSCALE) takes a database of n × k points as input and find maximal subspace clusters by hashing the signatures generated through Algorithm 3 (FindSignatures). The values of τ and are user defined. The key database K is randomly generated but can also be supplied as input. The hash table hT able can be indexed on sum value for direct access and thus, faster collisions. The core-sets are found in each dimension by sorting the projections of data points in that dimension. Starting with each point, the neighbours are collected until they start falling out of range. The Algorithm 1 is used to generate all combinatorial dense units from the core-sets. Algorithm 4 (findSum) finds the signature sum of each dense unit U which is collided with other signature sums to detect the maximal subspace of a dense unit. 3.2.7 Removing redundant computation of dense units The SUBSCALE algorithm can be optimised further for faster execution by removing the redundant calculations of dense units. The dense units are calculated using combinatorial mixing of all points from a core-set. Let us assume a particular dimension d contains sorted data points as: {P1 , P7 , P3 , P12 , P5 , P4 , P9 , P2 , P6 . . . } in that order and let τ = 3. As shown in Figure 3.8, using 43 1 Chapter 3. A novel fast subspace clustering algorithm Input: DB of n × k points. Output: Clusters: Set of maximal subspace clusters. Hash table hT able ← {} /* An entry in hT able is {sum, U, subspace}. 2 3 for j ← 1 to k do Signatures ← F indSignatures(DB, j) /* Get candidate signatures in dimension j. 4 5 6 7 8 9 10 11 12 13 14 */ */ for each entry Sigx : {sum, U, subspace} ∈ Signatures do if there exists another signature Sigy in hT able such that Sigy .sum = Sigx .sum then Append dimension j to Sigy .subspace else Add new entry Sigx to the hT able end end end for all entries {Sigx , Sigy , . . . } ∈ hT able do if Sigx .subspace = Sigy .subspace = . . . then Add entry {subspace, Sigx .U ∪ Sigy .U ∪ . . . } to Clusters /* ∪ is a union set-operator. Clusters contain maximal dense units in the relevant subspaces. */ 15 16 17 end end Run DBSCAN on each entry of Clusters /* Clusters is resulting set of maximal subspace clusters Algorithm 2: SUBSCALE: Find maximal subspace clusters. */ 44 1 2 3 Chapter 3. A novel fast subspace clustering algorithm Input: DB of n × k points; Dimension j; τ ; and . Output: Signatures: Set of signatures of the dense units. Sort P1 , P2 , . . . , Pn , s.t. ∀Px , Py ∈ DB, Pxj ≤ Pyj for i ← 1 to n − 1 do CS ← Pi /* CS is a core-set of dense points. 4 5 6 7 8 9 10 11 12 13 14 15 16 17 */ numN eighbours ← 1 next ← i + 1 j while next ≤ n and Pnext − Pij < do Append Pnext to CS Increment numN eighbours Increment next end if numN eighbours > τ then DenseU nits ← getDenseU nits(|CS|, τ + 1) end for each dense unit U ∈ DenseU nits do sum ← f indSum(U, K) subspace ← j Add entry {sum, U, subspace} to Signatures /* Signatures is a data structure to store the dense units along with their corresponding signatures in a given dimension. */ end end Algorithm 3: FindSignatures: Find signatures of dense units including their sums. 18 19 1 2 3 Input: A dense unit U, A set K of n unique random and large integers Output: sum: Sum of the keys corresponding to each point in the dense unit. sum ← 0 for each Pi in U do sum ← sum + M (Pi ) /* M as a one-to-one mapping function, M : DB → K */ end Algorithm 4: findSum: Calculates the sum of the corresponding keys of points in the dense unit U. 4 45 Chapter 3. A novel fast subspace clustering algorithm Step 2 of Algorithm 3, beginning with point P1 , the core-set CS1 = {P1 , . . . , P9 } and then with next point P7 in this sorted set, the core-set CS2 comes out as:{P7 , . . . , P9 }, based on a given value. The dense units are generated as the combinations of τ +1 points from the core-sets. We notice that all of the points in CS2 have already been covered by the CS1 , therefore, CS2 will not generate any new signatures than those generated by the core-set CS1 . The reason behind this redundancy is that both CS1 and CS2 share same lastElement which is, the data point P9 . We can eliminate these computations by keeping a record of the lastElement of the previous core-set CSi . If the lastElement of the core-set CSi+1 is same as that of the previous core-set, then we can safely drop the core-set CSi+1 . In 7 6 this case, the core-set CS1 generate dense units out which = 15 dense units 4 4 will be generated again by CS2 if we do not eliminate it. Another cause of redundant dense units is the overlapping of points between the consecutive core-sets. For example, as we can see in the Figure 3.8, the core-set CS3 starting with point P3 will contain points {P3 , . . . , P6 } and we cannot discard this core-set as the lastElement is not same as that of core-set CS2 . The intersecting set of points between 5 core-sets CS2 and CS3 consists of 5 points :{P3 , P12 , P5 , P4 , P9 }. Thus, = 5 com4 binations of CS3 would already have been generated by CS2 . To eliminate the redundant dense unit computations due to overlapping data points, we can use a special marker in each core-set called pivot which is the position of the lastElement of the previous core-set. For example, in core-set CS1 , the pivot can be set to −1 which means that we need to compute all of the combinations from this set as there is no existence of the previous lastElement and none of the combinations from this core-set has been computed before (Figure 3.9). There is one more scenario when we need to re-compute all the combinations of the core-set even when the lastElement from the previous core-set exists in the current core-set and that happens when 0 ≤ pivot ≤ τ . Therefore, when pivot > τ in the current core-set, we need not compute τ + 1 combinations for points lying between the index 1 to pivot in the core-set. For core- 46 Chapter 3. A novel fast subspace clustering algorithm P1, P7, P3, P12, P5, P4, P9, P2, P6, ... pivot = -1 CS1 discarded CS2 pivot = 5 CS3 Figure 3.9: An example of overlapping between consecutive core-sets of dense data points P1, P7, P3, P12, P5, P4, P9, P2, P6 CS1 CS2 pivot = 5 CS3 P 3, P12, P5, P4, P9 P2, P 6 CS32 CS31 Dense Unit Figure 3.10: An example of using pivot to remove redundant computations of dense units from the core-sets set CS3 = {P3 , P12 , P5 , P4 , P9 , P2 , P6 }, the points between index 1 and 5 have already 7 been computed for dense units by CS2 . Instead of computing = 35 combinations, 4 we can create partial combinations from two partitions of the core-set as shown in Figure 3.10. As in this case of core-set CS3 , using pivot based computation of dense units will result in only those dense unit combinations which have not been generated before in the previous core-sets. When the size of core-sets get bigger, this approach results in considerable savings in computational time and efficiency. The improved SUBSCALE algorithm is given below in Algorithm 5 (findOptimalSignatures) and Algorithm 6 (getDenseUnitsPivot). In these algorithms, we use the notation |Set| to denote the number of elements in the Set. 47 1 2 3 4 5 Chapter 3. A novel fast subspace clustering algorithm Input: DB of n × k points; Dimension j; τ ; and . Output: Signatures: Set of signatures of the dense units. Sort P1 , P2 , . . . , Pn , s.t. ∀Px , Py ∈ DB, Pxj ≤ Pyj last ← −1 pivot ← −1 for i ← 1 to n − 1 do CS ← Pi /* CS is a core-set of dense points 6 7 8 9 10 11 12 13 14 15 16 17 18 19 numN eighbours ← 1 next ← i + 1 j while next ≤ n and Pnext − Pij < do Append Pnext to CS Increment numN eighbours Increment next end newLast ← CS.lastElement if newLast! = last then pivot ← CS.indexOf (last) last ← newLast if numN eighbours > τ then if pivot ≤ τ then DenseU nits ← getDenseU nits(|CS|, τ + 1) /* |CS| is the total number of points in CS. 20 21 22 23 24 25 26 27 28 */ */ else DenseU nits ← getDenseU nitsP ivot(CS, pivot) end end end for each dense unit U found in the previous step do subspace ← j sum ← f indSum(U) Add entry {Sig, U, subspace} to Signatures /* Signatures is a data structure to store the dense units along with their corresponding sum values in a given dimension. */ end 30 end Algorithm 5: findOptimalSignatures: Find optimal signatures of dense units including their sums. 29 48 1 2 3 4 5 6 7 8 9 10 11 12 Chapter 3. A novel fast subspace clustering algorithm Input: Core-set CS, pivot. Output: DenseU nits: Set of dense units each of size τ + 1. Split CS into CS1 and CS2 such that CS1 contains first 1 . . . pivot points and CS2 contains rest of the points if |CS2 | > τ then DenseU nits ← getDenseU nits(|CS2 |, τ + 1) select ← τ else select ← |CS2 | end count ← 1 do Combine getDenseU nits(|CS1 |, τ + 1 − count) and getDenseU count) nits(|CS2 |, whichare partial dense units, to generate a |CS1 | |CS2 | total of × dense units and add them to the set τ + 1 − count count DeneU nits count ← count + 1 while count ≤ select Algorithm 6: getDenseUnitsPivot: Find dense units using the pivot. In the next section, we evaluate and discuss the performance of our proposed subspace clustering algorithm. 3.3 Results and discussion We experimented with various datasets upto 500 dimensions (Table 3.3). Also, we compared the results from our algorithm with other relevant state-of-the-art algorithms. We fixed the value of τ = 3 unless stated otherwise and experimented with different values of for each dataset, starting with the lowest possible value. The minimum cluster size (minSize) is set to 4. 3.3.1 Methods We implemented the SUBSCALE algorithm in Java language on an Intel Core i7-2600 desktop with 64-bit Windows 7 Operating system and 16GB RAM. The dense points in maximal subspaces were found through the SUBSCALE algorithm and then for each 49 Chapter 3. A novel fast subspace clustering algorithm Table 3.3: List of datasets used for evaluation Data D05 D10 D15 D20 D25 D50 S1500 S2500 S3500 S4500 S5500 madelon Size 1595 1595 1598 1595 1595 1596 1595 2658 3722 4785 5848 4400 Dimensionality 5 10 15 20 25 50 20 20 20 20 20 500 found subspace containing dense set of points, we used a Python Script to apply DBSCAN algorithm from the scikit library [66]. However, any full-dimensional density-based algorithm can be used instead of DBSCAN. The open source framework by Müller et al. [67] was used to assess the quality of our results and also to compare these results with other clustering algorithms available through the same platform. The datasets in Table 3.3 were normalised between 0 and 1 using WEKA [68] and contained no missing values. The data sets are freely available at the website of authors of the related work [67]. The 4400 × 500 madelon dataset available at UCI repository [69] The source code of the SUBSCALE algorithm can be downloaded from the Git repository [70]. 3.3.2 Execution time and quality The effects of changing values on runtime is shown in Figure 3.11 for two datasets of dimensionality 5 and 50. The larger values results in bigger neighbourhoods and generate large number of combinations, therefore, more execution time. 50 Chapter 3. A novel fast subspace clustering algorithm 60,000 50,000 D05 D50 Runtime (ms) 40,000 30,000 20,000 10,000 0 0 0.001 0.002 0.003 Epsilon 0.004 0.005 Figure 3.11: Effect of on runtime on two different datasets of 5 and 50 dimensions. The clustering time increases with the increase in -value due to bigger neighbourhood. More number of combinations needs to be generated for a bigger core-set. One of the most commonly used method to assess the quality of the clustering result in the related literature is F1 measure [71]. The open source framework by Muller et al. [67] was used to assess the F1 quality measure of our results. According to F1 measure, the clustering algorithm should cover as many points as possible from the hidden clusters and as fewer as possible of those points which are not in the hidden clusters. F1 is computed as the harmonic mean of recall and precision. The recall value accounts for the coverage of points in the hidden clusters by the found clusters. The precision value measures the coverage of points in found clusters from other clusters. A high recall and precision value means high F 1 and thus, better quality [71]. Figure 3.12 shows the F1 values for different datasets and with different epsilon settings. We notice that the quality of clusters deteriorates beyond a certain threshold of because clusters can get artificially large due to the larger value. Thus, a larger value of is not always necessary to get better quality clusters. 51 Chapter 3. A novel fast subspace clustering algorithm 0.0022 0.002 Epsilon 0.0018 0.0016 S2500 S3500 S4500 S5500 D20 D25 0.0014 0.0012 0.001 0.4 0.5 0.6 F1 0.7 0.8 0.9 1 Figure 3.12: -value versus F1 measure for six different datasets. The change in epsilonvalue seems to have similar impact on the cluster quality (F1 measure) for each dataset. The cluster quality degrades for bigger epsilon-values. We evaluated the SUBSCALE algorithm against the INSCY algorithm which is the recent state-of-the-art subspace clustering algorithm. The INSCY algorithm is available through the same framework used for F1 evaluation [67]. We used same parameter values (, τ and minSize) for both INSCY and SUBSCALE algorithms. As shown in Figures 3.13 and 3.14, the SUBSCALE algorithm gave much better runtime performance for similar quality of clusters (F1=0.9), particularly for higher-dimensional datasets. Figure 3.15 shows the effect of increasing dimensionality of the data on the runtime for the SUBSCALE algorithm versus other subspace clustering algorithms (INSCY, SUBCLU and CLIQUE available on Open source framework [67]), keeping the data size fixed (Data used: D05, D10, D15, D20, D25, D50, D75). The runtime axis of Figure 3.15 is plotted on the logarithmic scale. As we notice that CLIQUE algorithm shows the worst performance in terms of scalability w.r.t the dimensionality of the dataset. The parameters of τ = 0.003 and ξ = 3 were used for CLIQUE while similar values of = 0.001 and τ = 3 were used for SUBSCALE, INSCY and SUBCLU algorithms. However, the 52 Chapter 3. A novel fast subspace clustering algorithm 160,000 140,000 SUBSCALE Inscy Runtime (ms) 120,000 100,000 80,000 60,000 40,000 20,000 0 S1500 S2500 S3500 D20 D25 Clustering results with F1=0.9 D50 Figure 3.13: Runtime comparison between SUBSCALE and INSCY algorithms for similar quality of clusters (with F1 measure 0.9) for six datasets of different sizes and different number of dimensions. The SUBSCALE algorithm gave better performance than the INSCY algorithm. 53 Chapter 3. A novel fast subspace clustering algorithm 20,000 SUBSCALE Inscy Dataset: D25 (1595 x 25) Runtime (ms) 15,000 10,000 5,000 0 0.4 0.5 0.6 F1 0.7 0.8 0.9 1 Figure 3.14: Runtime comparison between SUBSCALE INSCY algorithms for the same dataset but using different quality output of clusters. The cluster quality can be changed by changing the epsilon-value used for finding the subspace clusters. The SUBSCALE algorithm gave better performance than the INSCY algorithm. 54 Chapter 3. A novel fast subspace clustering algorithm 100,000 SUBSCALE INSCY SUBCLU CLIQUE Runtime (ms) 10,000 1,000 100 10 5 10 15 Dimensionality 20 25 Figure 3.15: Runtime comparison between different subspace clustering algorithms for fixed data size=1595 and with dimensions varing from 5 to 25. As mentioned in the discussions, the SUBSCALE algorithm gave best performance. SUBCLU algorithm did not give much meaningful clusters as even single points were shown as clusters in the result and CLIQUE algorithm crashed for ≥ 15 dimensions. We observe that the SUBSCALE algorithm clearly performs better than the rest of algorithms with the increase in number of dimensions. Figure 3.16 shows the runtime comparison of our algorithm with INSCY and SUBCLU for a fixed number of dimensions and varying the size of the data from 1500 points to 5500 points (Dataset used: S1500, S2500, S3500, S4500, S5500). The 4400 × 500 madelon dataset has ∼ 2500 possible subspaces. We ran the SUBSCALE algorithm on this dataset to find all possible subspace clusters and Figure 3.17 shows the runtime performance with respect to values. The different values of ranging from 1.0 × 10−5 to 1.0 × 10−6 were used. We tried to run INSCY by allocating upto 12GB RAM but it failed to run for this high-dimensional dataset for any of the values. 55 Chapter 3. A novel fast subspace clustering algorithm 90,000 80,000 Runtime (ms) 70,000 SUBSCALE INSCY SUBCLU 60,000 50,000 40,000 30,000 20,000 10,000 0 1000 2000 3000 4000 Size of dataset 5000 6000 Figure 3.16: Runtime comparison between different subspace clustering algorithms for fixed dimensionality = 20 with data size varying from 1500 to 5500. 180,000 Dataset: madelon Runtime (ms) 160,000 138635 Clusters 140,000 120,000 100,000 80,000 345 60,000Clusters 40,000 0 41391 Clusters 4897 Clusters 5,000 10,000 15,000 20,000 No. of subspaces 25,000 30,000 Figure 3.17: Number of subspaces/clusters found vs runtime (madelon dataset). τ = 3, minSize = 4. Different values ranging from 1.0E − 5 to 1.0E − 6 were used. The number of clusters as well as subspaces in which these clusters are found, increases with the increase in value. 56 Chapter 3. A novel fast subspace clustering algorithm 3.3.3 Determining the input parameters An unsupervised clustering algorithm has no prior knowledge of the density distribution for the underlying data. Even though the choice of density measures (, τ and minSize) is very important for the quality of subspace clusters, finding their optimal values is a challenging task. The SUBSCALE algorithm initially requires both τ and parameters to find 1-D dense points. Once we identify the dense points in the maximal subspaces through the SUBSCALE algorithm, we can then run the DBSCAN algorithm for the identified subspaces by setting the τ , and minSize parameters according to each subspace. We should mention that finding clusters from these subspaces by running the DBSCAN algorithm or any other density based clustering algorithm will take relatively very less time. During our experiments, the average time taken by the DBSCAN algorithm comprises less than 5% of the total execution time for the evaluation datasets given in Table 3.3. The reason is that each of the identified maximal subspaces by the SUBSCALE algorithm has already been pruned for only those points which have very high probabilities of forming clusters in these subspaces. In our experiments, we started with the smallest possible -distance between any two 1-D projections of points in the given dataset. Considering the curse of dimensionality, the points become farther apart in high-dimensional subspaces, so the user may intend to find clusters with larger values than that used by the SUBSCALE algorithm for 1-D subspaces. Most of the subspace clustering algorithms use heuristics to adjust these parameters in higher-dimensional subspaces. Some authors [72,73] have suggested methods to adapt and τ parameters for high-dimensional subspaces. However, we would argue that the choice of these density parameters is highly subjective to the individual data set as well as the user requirements. 57 Chapter 3. A novel fast subspace clustering algorithm 3.4 Summary In this chapter, we have presented a novel approach to efficiently find the quality subspace clusters without expensive database scans or generating trivial clusters in between. We have validated our idea both theoretically as well as through numerical experiments. Using the SUBSCALE algorithm, we have experimented with 5 to 500 dimensional datasets and analysed in detail its performance as well as the factors influencing the quality of the clusters. We have also discussed various issues governing the optimal values of the input parameters and the flexibility available in the SUBSCALE algorithm to adapt these parameters accordingly. Since our algorithm directly generates the maximal dense units, it is possible to implement a query-driven version of our algorithm relatively easily. Such an algorithm will take a set of (query) dimensions and find the clusters in the subspace determined by this set of dimensions. However, the main cost in the SUBSCALE algorithm is computation of candidate one-dimensional dense units. All of these dense units need to be stored in a hash table in working memory first so that the collisions could be identified. So, the efficiency of the SUBSCALE algorithm seems to be limited by the availability of working memory. But, this algorithm has a high degree of parallelism as there is no dependency in computing these dense units across different dimensions. The computations can be split and processed as per the availability of working memory. We discuss these scalability issues and possible solutions in the next chapter. Chapter 4 Scalable subspace clustering 4.1 Background Datasets have been growing exponentially in recent years across various domains of healthcare, sensors, Internet and financial transactions [74, 75]. A recent IDC report has predicted that the total data being produced in the world will grow up to 44 trillion gigabytes by 2020 [76]. In fact, accumulated data has been following Moore’s law that is doubling in size every two years. This explosion has been both in size as well as dimensions of the data, for example, sophisticated sensors these days can measure increasingly large number of variables like location, pressure, temperature, vibration, humidity etc. Therefore, a subspace clustering algorithm should be scalable with both size and dimensions of the data. The SUBSCALE algorithm presented in the previous chapter needs to be explored further to accommodate required scalability. As the data grows in size and/or dimensions, the number of hidden clusters are also expected to grow. However, the number of clusters will depend upon the underlying data distribution and the density parameter settings. Sometimes a smaller dataset can have larger number of clusters than a bigger dataset. As shown in Figure 4.1, only one cluster exists in Data Set 1 which is seemingly bigger in 58 59 Chapter 4. Scalable subspace clustering Data Set 1 Data Set 2 Figure 4.1: The number of clusters from a bigger data set (left) can be less than the number of clusters hidden in the smaller data set (right) size than Data Set 2 which has three clusters hidden inside it. The reason is that Data Set 1 is much more uniformly distributed than Data Set 2. On the other hand, increase in dimensions has a different impact than increase in size of the dataset, on the number of clusters. As discussed in chapter 1, due to the curse of dimensionality, data appears to be sparse in higher dimensions. The lack of contrast among data points in higher-dimensional subspaces results in decrease in the number of clusters as the data points appear to be equidistant from each other. Figure 4.2 shows an example of increase in sparsity among data points as we move from 1dimensional to 2 and 3 dimensional data space. The three groups of points represented by red, green, and blue which were closer in the 1-dimensional space becomes farther apart in the 3-dimensional space. This is the reason why most of the algorithms which use all dimensions of the data to measure the proximity among points report fewer clusters for high-dimensional data. One more implication of this curse of dimensionality is that in high-dimensional data, clusters are often hidden in lower-dimensional projections. As the number of dimensions increases, the number of lower-dimensional projections (subspaces) also grow. Table 4.1 shows increase in the number of low-dimensional subspaces as the total number of dimensions increases from 10 to 10000. We have discussed some of the clustering algorithms in 60 Chapter 4. Scalable subspace clustering 3-dimensional space 2-dimensional space 1-dimensional space Figure 4.2: Data sparsity with increase in the number of dimensions. There are fewer clusters in high-dimensional dataset. The clusters lie in lower-dimensional subspaces of the data. detail in chapter 3 and have also highlighted reasons for why these clustering algorithms struggle to find all hidden clusters in high-dimensional data. The top-down projected clustering algorithms cannot handle high-dimensional data as these are more of a space partitioning algorithms and moreover, user has to define the number of clusters and the relevant subspaces. As clustering is an unsupervised data mining process, we do not have prior information about the underlying data density. Without exploring all possible subspaces it is not possible to find all hidden clusters, even though Table 4.1: Number of subspaces with increase in dimensions Size of the subspace 2 3 4 5 10 45 120 210 252 Total number of dimensions 100 1000 10000 4950 499500 49995000 161700 166167000 166616670000 3921225 41417124750 416416712497500 75287520 8250291250200 832500291625002000 61 Chapter 4. Scalable subspace clustering the number of clusters may turn out to be really small at the end. Most of the subspace clustering algorithms explode as the number of dimensions increases due to the exponential search space. The importance of the SUBSCALE algorithm becomes significant when it comes to dealing with high-dimensional data to find subspace clusters. The SUBSCALE algorithm computes the maximal subspace clusters directly through 1-dimensional dense points and without the need for exploring each and every lower-dimensional subspace. Each cluster is detected in its relevant maximal subspace and all possible non-trivial subspace clusters are detected. The main issue with the SUBSCALE algorithm proposed in the previous chapter is that it should be able to store all of the 1-dimensional dense points in the working memory of the system. So, the size of the working memory becomes a constraint for the scalability of this algorithm. Even though the cost of RAM (Random Access Memory) is coming down with each passing year, memory requirements can go extremely large for bigger data sets. Ideally, a subspace clustering algorithm should be able to crunch bigger datasets within the available memory. In this chapter, we aim to modify the SUBSCALE algorithm so that it can handle bigger datasets with limited working memory. In the next section, we discuss in detail the memory bottleneck caused during SUBSCALE computations. In section 4.3, we look at the solutions to avoid large memory requirement and propose a scalable algorithm in section 4.4. The experimental results with the scalable algorithm are analysed in section 4.5. 4.2 Memory bottleneck The main computation of the SUBSCALE algorithm is based on generation of the dense units across single dimensions. Even if these dense units are not combined iteratively in a step-by-step bottom-up manner as in the other subspace clustering algorithms, the 62 Chapter 4. Scalable subspace clustering = Figure 4.3: The information about a signature generated from a dense unit is stored in a Sig data structure. The information contains sum value, points in the dense unit and a set of dimensions in which this signature exists. signatures of the dense units still need to be matched with each other. A common hash table is required to match the dense units from different dimensions. Recalling the SUBSCALE algorithm from chapter 3, each of the k-dimensional n data points is mapped to a unique key from a pool of n large-integer keys. The dense units containing τ + 1 points are computed in each dimension and their signatures are calculated. A signature of a dense unit is the sum of the corresponding keys of the points contained in a dense unit. The value of a signature is thus, a large integer. If signatures of two dense units from different dimensions collide, that is, both have the same value, then, both dense units have exactly the same points in them with very high probability. The information about a dense unit along with its signature and the subspace in which it exists, is kept in a signature node, called Sig (Figure 4.3). Initially, the subspace information in each Sig contains the dimension in which the corresponding dense unit is created. When two dense units from two different dimensions collide, their subspace fields are merged. As proposed in the previous chapter, a common hash table is used to collide the signatures from different single dimensions. After all of the single dimensions have finished their collisions, the maximal sets of dense units are collected from the hash table. Each of the k-dimensions can generate different number of signatures. When two signatures collide with each other in the hash table, the complete details of the second signature need not be stored in the hash table as its sum value will be similar to the first one and only the subspace of the second signature is recorded. Therefore, the total capacity of the hash table should be roughly less than the total number of signatures in all dimensions. Considering the hash table in Figure 4.4, let p, q, . . . , r be the total number of signatures generated in each dimension from 1 to k and collisionConstant is the number 63 Chapter 4. Scalable subspace clustering Figure 4.4: Signatures from different dimensions collide in a common hash table. of signatures which collided with the signatures already in hash table and thus, does not need extra space in the hash table. The hash table should have enough capacity to store (p×q×· · ·×r)−collisionConstant signature nodes. As said earlier, we do not have prior information about the clusters and hence, we do not know which dense units or signatures will collide. Therefore, the value of collisionConstant is not known before hand. Even if the collisionConstant is known through some magic, the resulting memory requirement can be enormous for high-dimensional data sets. Depending upon the underlying data distribution, excessive number of dense units can be generated from a given dataset. The hash table needs to have enough capacity to store the signatures generated from these dense units. The underlying premise of the SUBSCALE algorithm is that if a dense unit Ux with a signature Sigxi exists in the ith dimension and it also exists in the pth and q th dimensions, then this unit will have the same signature in all of the three individual dimensions. To check for the maximal subspace {i, p, q} for dense unit Ux , we need to check for the collisions of the signatures Sigxi , Sigxp and Sigxq using a common hash table. The madelon dataset in UCI repository [69] has 4400 data points in 500 dimensions. Using parameters = 0.000001 and τ = 3, we calculated core-sets in each dimension and found that a total of 29350693 dense units are expected according to the Algorithm 5 64 Chapter 4. Scalable subspace clustering given in previous chapter. Each dense unit will have 4 data points because τ = 3. If we use random 14-digit integers as the keys, each sum generated from a dense unit can be up to 15-digit long integer and would require unsigned long long int or similar data type in a typical Java or C programming language. Referring to a signature node Sigx corresponding to a dense unit Ux , fixed space is required to store the sum and the τ + 1 dense points. But we cannot determine the space requirement for the subspace of Ux before all of the collisions from all single dimensions have taken place. Let us assume that no two dense units from different dimensions collided with each other and all of the 29350693 signatures need to be stored. In that case, the subspace field in Sigx will contain only a single dimension in which Sigx was generated. We assume that the data point ids and dimensions can be represented by int data type. The total space required for each signature node will be: sizeof (unsigned long long + (τ + 1) ∗ sizeof (int) + sizeof (int). On a typical machine, a signature node takes atleast 144 bytes of memory space. Therefore, the total space requirement for a hash table to store 29350693 entries is approximately 4 GB. As the size of data grows, this memory requirement for the hash table will increase substantially. Therefore, the SUBSCALE algorithm need to be reworked to accommodate the scalability, irrespective of the main memory constraint, which is still a bottleneck in its performance. In the next section we examine the hash table and the computations involved in it so as to improve the SUBSCALE algorithm for bigger data sets. 4.3 Collisions and the hash table Let us revisit the Algorithm 2 from chapter 3. The SUBSCALE algorithm proceeds sequentially with respect to the dimensions (Step 2 of Algorithm 2). A dimension (i + 1) will be processed after the dimension i has finished the collisions of its signatures in the hash table hT able (Step 8). If the hash table capacity has already reached its maximum 65 Chapter 4. Scalable subspace clustering by the time the (i + 1)th dimension is processed then hT able is more likely to either crash or give memory overflow error. If there was a maximal dense unit in a subspace {i, i + 1, i + 2}, it would not be detected. To check for a maximal subspace in which a dense units exists, we need to check for its collision by processing other dense units in all single dimensions. As discussed before, the total number of expected dense units can be pre-calculated from the core-sets in all single dimensions. One approach to tackle the limited memory could be to divide the dimensions into t non-overlapping sets: {dimSet1 , dimSet2 , . . . , dimSett } where each dimSet is a collection of dense units from one or more single dimensions. No dimension participates partially in a dimSet. These dimension sets can be processed independently either individually or together as a combination of more than one dimSet. The choice will depend upon the availability of the working memory to find partial maximal subspace dense units in that set. After finishing the collisions from a dimSet, the partial maximal dense units can be swapped back to the secondary storage. After processing all of such dimension sets or combination of dimension sets, these partial maximal dense units can be combined (using signature collisions again) to get maximal subspace dense units. For example, if Ux = {P1 , P2 , P3 , P4 } is a partial maximal dense unit in dimSet : {2, 3, 4} and if it also exists in dimSet : {5, 6, 7} then Ux exists in the union of these sets, that is, dimSet : {2, 3, 4, 5, 6, 7}. There are two problems with the above approach. Firstly, the density distribution of the projections of data points in each dimension is different. For bigger datasets, the number of combinatorial dense units from a single dimension can surpass the available memory. To find even partially maximal dense units, we need to process more than 1 dimension from a dimension set. Secondly, the partial maximal dense units need to be combined with all of the other dense units using collisions to find the complete set of maximal dense units. Both of these arguments make this approach highly inefficient. An alternative way is to let dense units from all single dimensions collide in the hash table but with a control over the number of dense units being generated and stored. We 66 Chapter 4. Scalable subspace clustering know that each dense unit is identified with its signature, which is an integer value. If a dense unit matches with another dense unit, the signatures of both will come from the same range of integer values. We can split the combinatorial dense unit computation into slots where each slot has an integer range for allowed signature values. The size of a slot can be adapted according to the available memory and thus, hashing of signatures can fit in the given hash table. We discuss this approach in detail in the following subsection. 4.3.1 Splitting hash computations The performance of our algorithm depends heavily on the number of dense units being generated in each dimension. The number of dense units in each dimension is derived from the size of the data, chosen value of and the underlying data distribution. A larger increases the range of the neighbourhood of a point and is likely to produce bigger core-sets. As we do not have prior information about the underlying data distribution, a single dimension can have a large number of dense units. Thus, for a scalable solution to subspace clustering through the SUBSCALE algorithm, the system must be able to handle a large number of collisions of the dense units. In order to identify collisions among dense units across multiple dimensions, we need a collision space (hT able) big enough to hold these dense units in the working memory of system. But with the limited memory availability, this is not always possible. If numj is the number of total dense units in a dimension j, then a k-dimensional dataset may Pk j have N U M = j=1 num dense units in total. The signature technique used in the SUBSCALE algorithm has a huge benefit that we can split N U M to the granularity of k. As each dense unit contains τ + 1 points and we are assigning large integers to these points from the key database K to generate signatures H, the value of any signature thus generated would approximately lie between the range R = (τ + 1) × min(K), (τ + 1) × max(K) , where min(K) and max(K) are respectively the smallest and the largest keys being used. 67 Chapter 4. Scalable subspace clustering Figure 4.5: Illustration of splitting hT able computations. For τ = 3, Split factor sp = 3, minimum large-key value min(K) = 1088 and maximum large-key value max(K) = 9912, approximate range of expected signature sums is (1088 × (τ + 1), 9912 × (τ + 1)). Each signature sum is derived from a dense unit of fixed size = τ + 1. The detection of maximal dense units involves matching of the same signature across multiple dimensions using a hash table. Thus, the hash table should be able to retain all signatures from a range R. We can split this range into multiple chunks so that each chunk can be processed independently using a much smaller hash table. If sp is the split factor, we can divide the range R into sp parts and thus, into sp hash tables where each hT able holds at the most R sp entries for these dense units. But since the keys are being generated randomly from a fixed range of digits, the actual entries will be very less. In a 14-digit integer space, we have 9 × 1013 keys to choose from (9 ways to choose the most significant digit and 10 ways each for the rest of the 13 digits). The number of actual keys being used will be equal to the number of points in the dataset |DB|. 68 Chapter 4. Scalable subspace clustering In Figure 4.5, we illustrate the splitting of hT able computations with a range of 4-digit integers from 1000 to 9999. Let |DB| = 500 and so we need 500 random keys from 9000 available integer keys. If τ = 3 then, some of these 500 keys will form dense core-sets. Let us assume that 1/5th of these keys are in a core-set in some 1-dimensional space, then 100 we would need a hash table big enough to store ≈ 4 million entries, which is . If 4 we choose the split factor of 3 then we have 3 hash tables where each hash table can store approx 1 million of entries. Typically, Java uses 8 bytes for a long key, so 32 bytes for a signature with τ = 3 and additional bytes to store the colliding dimensions (say ≈ 40 bytes per entry for an average subspace dimensionality of 10), are required. In the next section, we explain the scalable version of the SUBSCALE algorithm followed by its experimental evaluation and analysis of the results. 4.4 Scalable SUBSCALE algorithm The SUBSCALE algorithm from Chapter 3 was redesigned to accommodate scalability with bigger datasets. The pseudo code for the scalable SUBSCALE algorithm is given in Algorithm 7 (scalableSUBSCALE) below. Instead of finding core-sets in each dimension and simultaneously hashing them into the hash table hT able as in previous chapter, the core-sets in all single dimensions are precomputed (step 1 of Algorithm 7) using Algorithm 8 (findCoreSets). The split factor sp is supplied by the user along with and τ values. Each of the sp slices generates candidate signatures using Algorithm 9 (findSignaturesInRange), between LOW and HIGH values computed in steps 2-4 of Algorithm 7. Each entry Sigx of hT able is a signature node: {sum, U, subspace} corresponding to a dense unit U. It is expected that a large number of subspaces containing maximal dense units will be detected for bigger datasets. To avoid memory overflow, we store these maximal dense units in the relevant file storage (steps 19-21 of Algorithm 7). The relevant file is named as the size of the subspace in which a maximal dense unit is detected. Thus, all 69 Chapter 4. Scalable subspace clustering 2-dimensional maximal dense units will be stored in a file named ‘2.dat’, 3-dimensional maximal dense units will be stored in a file named ‘3.dat’ and so on. These files can be processed later using a scripting language to run DBSCAN or similar cluster generation algorithm on the already detected dense points in these files. The Algorithm 10 (denseUnitsInRange) is a modified version of Algorithm 6 (getDenseUnitsPivot) from the last chapter. The main difference is an additional check to maintain signature sum values between LOW and HIGH . Also, instead of generating all dense units from a core-set and then filtering them for those with signature sums between a given LOW and HIGH range, the algorithm is optimised (steps 22-32 of Algorithm 10) with a condition check such that, the core-set processing will stop when all of the next dense unit generated from the core-set are expected to have signature sums more than or equal to HIGH. In Algorithm 10, seed, tempseed and U are initialized as empty arrays of size r each and sums is initialized as an empty array of size r + 1. 4.5 Experiments and analysis We implemented the SUBSCALE algorithm in Java language on an Intel Core i7-2600 desktop with 64-bit Windows 7 OS and 16GB RAM. The pedestrian data set is extracted from the attributed pedestrian database [77, 78] using the Matlab code given in APiS1.0 [79]. The madelon data set available at UCI repository [69]. Both of the datasets were normalised between 0 and 1 using WEKA [68] and contained no missing values. The source code of scalable version of the SUBSCALE algorithm is available at [80]. We ran the SUBSCALE algorithm on madelon dataset of 4400 points and 500 dimensions using different values of sp ranging from 1 to 500 and Figure 4.6 shows its effect on the runtime performance for two different values of . The darker line is for larger value of and hence, higher runtime. But for both of the values, the execution time is almost proportional to the split factor after an initial threshold. 70 1 Chapter 4. Scalable subspace clustering Input: DB of n × k points; Set K of n unique, random and large integers; ; τ ; Split factor sp Output: Clusters: Set of maximal subspace clusters CS 1 , CS 2 , . . . , CS k ← F indCoreSets(DB) /* Get core-sets in all k dimensions. 2 3 /* max(K) and min(K) are the maximum and minimum values of keys available in the key database K. 4 5 6 8 9 10 12 13 15 16 17 18 19 20 21 */ for each candidate Sigx ∈ CandidateN odes do if there exists a signature node, Sigy in hT able, such that Sigx .sum = Sigy .sum then Sigy .subspace ← Sigy .subspace ∪ Sigx .subspace /* ∪ is a union set-operator 14 */ LOW ← HIGH HIGH ← LOW + SLICE for j ← 1 to k do CandidateN odes ← f indSignaturesInRange(CS j , LOW, HIGH) /* Get candidate signature nodes from core-sets in CS j . 11 */ HIGH ← LOW for split ← 1 to sp do Hashtable hT able ← {} /* Initialise an empty hash table. 7 */ SLICE ← ((max(K) − min(K)) ∗ (τ + 1))/sp LOW ← min(K) ∗ (τ + 1) */ else Add Sigx to hT able end end end for all entries {Sigx , Sigy , . . . } ∈ hT able do if Sigx .subspace = Sigy .subspace = . . . then Add entry {Sigx .U ∪ Sigy .U ∪ . . . } to |subspace|.datf ile /* ∪ is a union set-operator. |subspace| is the number of dimensions in the subspace. Each of the d.dat file contain maximal dense units in the relevant d-dimensional subspaces. */ end 23 end 24 end 25 Run any full dimensional clustering algorithm, for example, DBSCAN on each entry of the d.dat file to output maximal subspace Clusters Algorithm 7: scalableSUBSCALE: Scalable version of the SUBSCALE algorithm 22 71 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Chapter 4. Scalable subspace clustering Input: DB of n × k points, τ and Output: A collection of core-sets CS 1 , CS 2 , . . . , CS k in all k dimensions for j ← 1 to k do Sort points P1 , P2 , . . . , Pn , s.t. ∀Px , Py ∈ DB, Pxj ≤ Pyj last ← −1 x←1 for i ← 1 to n − 1 do tempSet ← Pi numN eighbours ← 1 next ← i + 1 j while next ≤ n and Pnext − Pij < do Append point Pnext to tempSet Increment numN eighbours Increment next end newLast ← lastElement in tempSet if newLast! = last then last ← newLast if numN eighbours ≥ τ then CSxj = tempSet /* CSxj is a core-set of dense points. 19 20 21 22 23 Increment x end end end end Algorithm 8: FindCoreSets: Find core sets in the given dataset. */ 72 Chapter 4. Scalable subspace clustering Input: Core-sets CS j from j th dimension, LOW , HIGH /* For readability, we drop the dimension index j from CS 1 2 */ Output: Set of candidate signature nodes: CandidateN odes lastElement ← −1 for i ← 1 to |CS| do /* |CS| is the number of core-sets */ pivot ← indexOf (lastElement) in CSi 4 if pivot ≤ τ then 5 DenseU nits ← denseU nitsInRange(|CSi |, τ + 1) 6 else 7 Split CSi into CSi1 and CSi2 such that CSi1 contains first 1 . . . p points and CSi2 contains the rest of the points 8 if |CSi2 | > τ then 9 DenseU nits ← denseU nitsInRange(|CSi2 |, τ + 1) 10 select ← τ 11 else 12 select ← |CSi2 | 13 end 14 count ← 1 15 do 16 partial1 ← partialDenseU nitsInRange(|CSi1 |, τ + 1 − count) 17 partial2 ← partialDenseU nitsInRange(|CSi2 |, count) 18 for p ← 1 to |partial1 | do 19 for q ← 1 to |partial2 | do 20 if LOW ≤ f indSum(partial1 [p]) + f indSum(partial2 [q]) < HIGH then 21 Merge both dense units partial1 [p] and partial2 [q] and add to set DenseU nits 22 end 23 end 24 end 25 Increment count 26 while count ≤ select 27 end 28 end 29 for each dense unit U ∈ DenseU nits do 30 sum ← f indSum(U, K) 31 subspace ← j 32 Add signature {sum, U, subspace} to CandidateN odes 33 end Algorithm 9: findSignaturesInRange: Find candidate signature nodes in a given coreset and with signature sum between LOW and HIGH. 3 73 Chapter 4. Scalable subspace clustering Input: CS, r, K, and is its complete dense unit of size τ + 1. Output: DenseU nits: A set of dense units, each of size r. 1 for i ← 1 to c do 2 localKeys[i] ← M (CS[i] 7→ K) 3 end 4 Sort localKeys in ascending order 5 for i ← 2 to r do 6 seed[i] ← c − r + i 7 end 8 seed[1] ← 0 9 while true do 10 i←r 11 while i > 0 and seed[i] = c − r + i do 12 Decrement i /* Get the active position. */ 13 end 14 if i = 0 then 15 break /* All combinations have been generated. */ 16 else 17 temp ← seed[i] /* Get seed element. */ 18 for j ← i to r do 19 k ← temp + 1 + j − i 20 tempseed[j] ← k 21 tempsum ← sums[j] + localKey[temp] 22 if tempsum ≥ HIGH then 23 f lag ← true /* Skip the rest of the computations */ 24 while (j > 2) and ((tempseed[j] − tempseed[j − 1]) < 2) do 25 Decrement j 26 end 27 while j ≤ r do 28 tempseed[j] ← c − r + j 29 Increment j /* Reset the seed */ 30 end 31 break 32 end 33 sums[j + 1] ← tempsum 34 U[j] ← M (K[localKey[temp]] 7→ DB) 35 end 36 seed ← tempseed 37 f lag ← f alse /* Go to the next iteration */ 38 if f indSum(U) ≥ LOW then 39 Copy dense unit U to the output set DenseU nits 40 end 41 end 42 end Algorithm 10: denseUnitsInRange: Find all combination dense units of size r from a core-set CS such that the signature sum of each dense unit is less than HIGH but greater than or equal to LOW . 74 Chapter 4. Scalable subspace clustering Input: CS, r, K, and is its complete dense unit of size τ + 1. Output: DenseU nits: A set of dense units, each of size r. 1 This algorithm is similar to Algorithm 10 (denseUnitsInRange) except the last step. No check is made for f indSum(U) ≥ LOW . 2 The dense unit U is simply copied to the output set DenseU nits. Algorithm 11: partialDenseUnitsInRange: Find all combination dense units of size less than r from a core-set CS such that the signature sum of each dense unit is less than HIGH. 600,000 Runtime (ms) 500,000 Epsilon=2.0E−6 Epsilon=5.0E−6 400,000 300,000 200,000 100,000 0 0 100 200 300 Split Factor 400 500 Figure 4.6: Runtime vs split factor for madelon dataset. The execution time is almost proportional to the split factor after an initial threshold. 75 Chapter 4. Scalable subspace clustering The size of hT ablei can be adjusted to handle large databases of points by choosing an appropriate split factor sp. Instead of generating all dense units in all dimensions sp times, if we sort the keys of density-connected points, we can stop the computation of dense units when no more signature sums less than the upperlimit of hT ablei are possible. We successfully ran this modified SUBSCALE algorithm for 3661 × 6144 pedestrian dataset. We used = 0.000001, τ = 3, minSize = 4, sp = 4000 and it took 336 hrs to finish and compute dense units in 350 million subspaces. We encountered memory overflow problems when handling a large number of subspace clusters and that was due to the increasing size of Clusters data structure used in Algorithm 2. We found a solution by distributing the dense points of each identified maximal subspace from the hash table to a relevant file on the secondary storage. The relevance is determined by the cardinality of the found subspace. If a set of points is found in a maximal subspace of cardinality m, we can store these dense points in a file named ‘m.dat’ along with their relevant subspaces. It will also facilitate running any fulldimensional algorithm like DBSCAN on each of these files as its parameters can be set according to different dimensionality. 4.6 Summary The generation of large and high-dimensional data in the recent few years has overwhelmed the data mining community. In this chapter, we have presented a scalable version of the SUBSCALE algorithm proposed in previous chapter. The scalable version has performed far better when it comes to handling high-dimensional datasets. We have experimented with up to 6144 dimensional data and we can safely claim that it will work for larger datasets too by adjusting the splitf actor. However, the main cost in the scalable SUBSCALE algorithm is the computation of the candidate 1-dimensional dense units. In addition to splitting the hash table compu- 76 Chapter 4. Scalable subspace clustering tation, the SUBSCALE has a high degree of parallelism as there is no dependency in computing dense units across multiple dimensions. We exploit the parallelism in the algorithm structure of the SUBSCALE algorithm in the next chapter using OpenMP based shared memory architecture. Chapter 5 Parallelization 5.1 Introduction The growing size and dimensions of the data these days have set new challenges for the data mining research community [5]. Clustering is a data mining process of grouping similar data points into clusters without any prior knowledge of the underlying data distribution [46]. As discussed in chapter 2, the traditional clustering algorithms either attempt to partition a given data set into predefined number of clusters or use full-dimensional space to cluster the data [47]. However, these techniques are unable to find all hidden clusters especially in high-dimensional data. The increase in the number of dimensions of data, impede the performance of these clustering algorithms which are known to perform very well with low dimensions. As discussed in the previous chapters, data group together differently under the different subsets of dimensions, called subspaces. A set of points can form a cluster in a particular subspace and can be part of different clusters or may not participate in any cluster in other subspaces. Thus, it becomes imperative to find all hidden clusters in these subspaces. The subspace clustering algorithms forms a branch of clustering algorithms which attempt to find clusters in all possible subset of dimensions of a given data set [16, 49]. 77 78 Chapter 5. Parallelization Usually distance or density among the points is used to measure the similarity. Given an n × k dataset, a data point is a k-dimensional vector with values measured against each of the k dimensions. Two data points are said to be similar in a given subset of dimensions (subspace) if the values of these points under each dimension participating in this subspace are similar as per the similarity criteria. Since a k-dimensional data set can have upto 2k − 1 possible axis-parallel subspaces, therefore, the search space for subspace clustering becomes exponential in dimensions. Subspace clustering is computationally very expensive process. Most of the relevant algorithms are inefficient as well as ineffective for high-dimensional data sets. With the wider availability of multi-core processors these days, parallelization seems to be an obvious choice to reduce this computational cost. There has been some work in the literature for parallel algorithms in subspace clustering. But earlier subspace clustering algorithms have less obvious parallel structures. This is partially due to the data dependence during the processing sequence. In chapter 3, we proposed SUBSCALE algorithm which is a promising approach to find the non-trivial subspace clusters without enumerating the data points [81]. This algorithm requires only k database scans for a k-dimensional data. In Chapter 4, we have proposed the scalable version of the SUBSCALE algorithm. The widespread availability of multi-core processors have fuelled our endeavour to parallelize the SUBSCALE algorithm and further reduce its time complexity. In this chapter, we present the modifications in SUBSCALE algorithm to utilise the multiple threads through OpenMP framework [82]. The SUBSCALE algorithm first generates the dense set of points across all 1-dimensional subspaces and then efficiently combine them to find the non-trivial subspace clusters. The non-trivial subspace clusters are also called maximal clusters. If a set of data points forms a cluster C in a particular subspace of d dimensions where k are the total dimensions and d ≤ k, then this cluster will exist in all of the 2d subsets of this subspace [14]. Although SUBSCALE algorithm does not generate any trivial subspace clusters, its time complexity is still compute intensive due to the generation of the combinatorial 1-dimensional dense 79 Chapter 5. Parallelization set of points. However, the compute time can be reduced by parallelizing the computation of the dense units. In this chapter, we focus on scalable SUBSCALE subspace clustering algorithm due to the computational independence in the structure of this algorithm. We aim to utilise the multi-core architecture to accelerate the SUBSCALE algorithm while providing the same output as a sequential version. We investigate the runtime performance with upto 48 cores running in parallel. The experimental evaluation demonstrates the speedup of upto the factor of 27. Our modified algorithm is faster and scalable for high-dimensional large data sets. In the next section we discuss some of the related literature. The section 5.3 gives the background of SUBSCALE algorithm and our approach. In section 5.4, we analyse the performance of parallel implementation and finally, the chapter is summarized in section 5.5. 5.2 Related work Over the past few years, there has been extensive research in the clustering algorithms [5, 36, 46]. One of the famous techniques to deal with high dimensionality is to reduce the number of dimensions by removing the irrelevant (or less relevant) dimensions, for example, Principal Component Analysis (PCA) transforms the original high-dimensional space into low-dimensional space [83]. Since PCA preserves the original variance of the fulldimensional data during this transformation, therefore, if no cluster structure was detected in the original dimensions, no new clusters in the transformed dimensions will be found. Also, the transformed dimensions lack the intuitive meaning as it is difficult to interpret the clusters found in the new dimensions in relation to the original data space. The significance of local relevance among the data with respect to the subset of dimensions, has lead to advent of subspace clustering algorithms [16, 49]. 80 Chapter 5. Parallelization The projected clustering algorithms like PROCLUS [13] and FINDIT [56] require users to input the number of clusters and the number of subspaces, which is difficult to estimate for the real data sets. Hence, these algorithms are essentially data partitioning techniques and cannot discover the hidden subspace clusters in the data. The algorithms based on full-dimensional space like DBSCAN [10] are also ineffective for high-dimensional data sets due to the curse of dimensionality [81]. According to the DBSCAN algorithm, a point is dense if it has τ or more points in its -neighbourhood and a cluster is defined as a set of such dense (similar) points. Two point are said to be in the same neighbourhood (similar) if the values under each of the corresponding dimension lies within distance. As mentioned in the previous section, a data point is a vector in a k-dimensional space. But, for high-dimensional data, the clusters exist in the subspaces of the data as two points might be similar in a certain subset of dimensions but may be totally unrelated (or distant) in the other subset of dimensions. The underlying premise that data group together differently under different subsets of dimensions opened the challenging domain of the subspace clustering algorithms [16, 49, 50]. Agrawal et al. [15] were the first to introduce the grid-density based subspace clustering approach in their famous CLIQUE algorithm. The data space is partitioned into equal-sized 1-dimensional ξ units using a fixed size grid. A unit is considered dense if the number of points in it exceeds the density support threshold, τ . A subspace with k1 dimensions participating in it is called higher-dimensional than another subspace with k2 dimensions in it if k1 > k2 . The lower-dimensional candidate dense units are combined together iteratively for computing higher-dimensional dense units (clusters), starting from the 1-dimensional units. There are many other variations of this algorithm, e.g., using entropy [61] and adaptive grid [60]. Instead of using the grid, SUBCLU algorithm applied DBSCAN one each of the candidate subspaces [58] where DBSCAN is an full-dimensional clustering algorithm. The INSCY algorithm [59] is an extension of SUBCLU which uses indexing to compute and merge 1-dimensional base clusters to find the non-trivial subspace clusters. 81 Chapter 5. Parallelization The underlying premise that data group together differently under different subsets of dimensions opened the challenging domain of subspace clustering algorithms [16, 50]. Although all of these subspace clustering algorithms can detect previously unknown subspace clusters, they fail for high-dimensional data sets. The inefficiency arises due to the detection of redundant trivial clusters and an excessive number of database scans during the clustering process. The subspace clustering is a compute intensive task and parallelization seems to be an obvious choice to reduce this computational cost. But, most of the subspace clustering algorithms have less obvious parallel structures [15, 58]. This is partially due to the data dependency during the processing sequence [84]. The SUBSCALE algorithm introduced in the previous chapters requires only k database scans to process a k-dimensional dataset. Also, this algorithm is scalable with the dimensions and does not compute the trivial clusters as compared to the existing algorithms. Even though this algorithm does not generate any trivial subspace clusters, its time complexity is still compute intensive due to the generation of the combinatorial 1-dimensional dense set of points. The compute time can be reduced by computing these 1-dimensional dense units in parallel. In the next section, we briefly discuss the SUBSCALE algorithm and our modifications for parallel implementation. 5.3 Parallel subspace clustering The increasing availability of multi-core processor these days further the expectations for efficient clustering of high-dimensional data. Parallel Processing of data can help to speed-up the execution time by sharing the processing load amongst multiple threads running on multiple processors or cores. However, the sequential process should be decomposed into independent units of execution so as to be distributed among multiple threads running on separate cores or processors. Also, management of threads from their gen- 82 Chapter 5. Parallelization eration to termination including inter-communication and synchronisation makes parallel processing a complex task. In this section, we discuss the parallel implementation of subspace clustering using multi-core architectures in detail. Our extend the work done in previous chapters on the SUBSCALE algorithm. We aim to further reduce the execution time by parallelizing the compute intensive part of the SUBSCALE algorithm. Before presenting our approach, some of the basic definitions and concepts are as below: Definitions Let DB be a database of n × k points where DB : {P1 , P2 , . . . , Pn }. Each point Pi is a k-dimensional vector {Pi1 , Pi2 , . . . , Pik } such that, Pid is the projection of a point Pi in the dth dimension. A point refers to the data point from the dataset. A subspace is the subset of the dimensions. For example, S : {r, s} is a 2-dimensional subspace consisting of rth and sth dimension and the projection of a point Pi in this subspace is {Pir , Pis }. The dimensionality of a subspace refers to the total number of dimensions in it. A single dimension can be referred as a 1-dimensional subspace. A subspace with dimensionality a is a higher-dimensional subspace than another subspace with dimensionality b, if a > b. Also, a subspace S 0 with dimensionality b is a projection of another subspace S of dimensionality a, if a > b and S 0 ⊂ S, that is, all the dimensions participating in S 0 are also contained in the subspace S. A subspace cluster Ci = (P, S) is the set of points P , such that the projections of these points in subspace S, are dense. A cluster Ci = (P, S) is called a maximal subspace cluster, if there is no other cluster Cj = (P, S 0 ) such that S 0 ⊃ S. According to the Apriori principle [14], it is sufficient to find only the maximal subspace clusters rather than all clusters in all possible subspaces. The reason behind this sufficiency is that a dense set of points in higher-dimensional subspace is dense in all of its lower-dimensional 83 Chapter 5. Parallelization projections. The lower-dimensional projections of a maximal cluster contains redundant information. Next, we give an overview of the SUBSCALE algorithm and also, highlight the research problem. 5.3.1 SUBSCALE algorithm The SUBSCALE is a clustering algorithm to find maximal subspace clusters without generating the trivial lower-dimensional clusters. The projections of the dense points in the maximal subspace cluster, will be dense in all single dimensions participating in this subspace. The main idea behind the SUBSCALE algorithm is to find the dense sets of points (density chunks) in all of the k single dimensions, generate the relevant signatures from these density chunks, and collide them in a hash table (hT able) to directly compute the maximal subspace clusters. The Algorithm 12 explains these processing steps. We now briefly describe the process of finding density chunks and the corresponding signatures. 1 2 3 4 5 6 7 8 Input: DB : n × k data, a set of n keys:K Output: Dense points in maximal subspaces Initialize a common hash table hT able. for dimension j ← 1 to k do Scan {P1j , P2j , . . . , Pnj } and find density chunks for each density chunk do create signatures and hash them to hT able end end Collect all collisions from hT able to output dense points in maximal subspaces Algorithm 12: SUBSCALE algorithm in brief Density chunks The SUBSCALE algorithm uses distance (L1 metric) based similarity measure to define the density among the data points. Based on two user defined parameters and τ , a data point is dense if it has atleast τ points within distance. The neighbourhood N (Pi ) of a point Pi in a particular dimension d is a set of all the points Pj such that L1 (Pid , Pjd ) < , 84 Chapter 5. Parallelization P1 P2 P3 P4 Dimension d2 P5 P7 P9 P11 P12 P13 P14 Dimension d1 Figure 5.1: Figure adapted from [81] illustrates the lack of information about which 1dimensional clusters (dense units) will generate the maximal clusters. Pi 6= Pj . Each dense point along with its neighbours, forms a density chunk such that each member of this chunk is within distance from each other. The smallest possible dense set of points is of size τ + 1, known as a dense unit. In a t particular dimension, a density chunk of size t will generate τ +1 possible combinations of points to form the dense units. Some of these dense units may or may not contain projections of higher-dimensional maximal subspace clusters. As we do not have prior information of the underlying data distribution, it is not possible to know in advance that which of these dense units are significant. Only possibility is, to check which of these dense units from different dimensions contain identical points. As shown in Figure 5.1, the projections of the points {P7 , P8 , P9 , P10 } in dimension d2 form a 1-dimensional cluster (Cluster3 ), but there is no 1-dimensional cluster in dimension d1 with the identical points as Cluster3 , thus, the absence of a cluster in the subspace {d1 , d2 } with the points {P7 , P8 , P9 , P10 }. Signatures To create signatures from the dense units, n random and unique keys made of integers with large digits are chosen to create a pool of keys called K. Each of the n data points is 85 Chapter 5. Parallelization = Figure 5.2: Each signature node, corresponds to a dense unit and consists of the sum of the keys in this dense unit, the data points contained in the dense unit and the dimensions in which this dense unit exists. mapped to a key from the keys database K on 1:1 basis. The sum of the mapped keys of the data points in each dense unit is termed as its signature. According to the observation 2 and 3 in the chapter 3, two dense units with equal signatures would have identical points in them. Thus, collisions of the signatures across dimensions dr , . . . , ds implies that, the corresponding dense unit exists in the maximal subspace, S = {dr , . . . , ds }. We refer our readers to the chapter 3 for the detailed explanation and the proof of this concept. Each single dimension may have zero or more dense chunks, which in turn generate different number of signatures in each dimension. Some of these signatures will collide with the signatures from the other dimensions to give a set of dense points in the maximal subspace. Hashing of signatures The SUBSCALE algorithm uses an hT able data structure similar to the hash table to compute the dense units in the maximal subspaces. The hT able is a simple storage mechanism to store the information regarding the signatures of the dense units generated across single dimensions. A signature node, SigN ode is used to store the information pertaining to each dense unit (Figure5.2). Each SigN ode contains: sum of the keys of the corresponding dense unit, the data points in the dense unit, the dimensions in which this dense unit was computed and the pointer to the next signature node, if any. Figure 5.3 shows the hash table used in this chapter. The hT able consists of a fixed number of slots (numSlots) and each slot can store one or more signature nodes. In this chapter, we used modulo function to assign a slot to a SigN ode. Two or more signature nodes with different sums may be allotted the same slot in the hT able, if the modulo output is same for these nodes. The linked list can be used to store more than 86 Chapter 5. Parallelization Figure 5.3: hT able data structure in SUBSCALE to store signatures and their associated data from multiple dimensions. numSlots is the number of total slots available in the hT able. Each slot may have 0 or more signature nodes stored in it. one SigN ode at a slot. An hT able is thus, a collection of signature nodes. Two dense units are said to be colliding if they have equal signatures, which is the sum value in its signature node (Figure 5.2). When two dense units collide, an additional dimension is appended in the SigN ode. Memory and runtime cost The value of the numSlots depends on the number of dense units being generated from each density chunk, which in turn depends on the number of density chunks in each dimension. We do not have any prior information about the underlying data distribution or density. Even though the total number of dense units can be calculated by first creating the density chunks in all of the single dimensions through the formula given in 8 above, we do not know in advance that which of these dense units will collide and which would not collide. When two signature nodes collide, we do not store the second node again, just an additional dimension is appended to the first node. To find the maximal subspace clusters, all possible dense units in all of the single dimensions are required to be generated. The storage requirements for the total signatures generated from a data set can outgrow the available memory in the system to store 87 Chapter 5. Parallelization hT able. It adds to the cost of time and memory requirements for the SUBSCALE algorithm. The sequential version of the SUBSCALE algorithm proposed splitting of hash table computations to overcome this memory constraint. The size of each dense unit is τ + 1. If K is the key database of n large integers, then value of a signature generated from a dense unit would lie approximately between the range R = (τ + 1) × min(K), (τ + 1) × max(K) , where min(K) and max(K) are the smallest and the largest keys respectively. Also, if numSig d is the number of total signatures in a dimension d, then the total number of signatures in a k dimensional data set will P be totalSignatures = kd=1 numSig d . If memory is not a constraint then a hash table with R slots can easily accommodate total signatures as typically, totalSignatures R. Since memory is a constraint, the SUBSCALE algorithm splits this range R into multiple slices such that each slice can be processed independently using a separate and much smaller hash table. The computations for each slice is not dependent of other slices. The split factor called sp determines the number of splits of R and its value can be set according to the available working memory. Thus, the computations of dense units in each single dimension as well as each single slice can be processed independent of others. In the next section, we endeavour to use these independence among dense units to reduce the execution time for the SUBSCALE algorithm with multiple cores. 5.3.2 Parallelization using OpenMP In the previous subsection, we briefly investigated the internal working of the SUBSCALE algorithm and identified few areas which can be processed independently. Next, we discuss how OpenMP threads on multi-core architectures can be used to exploit the parallelism in the SUBSCALE algorithm. 88 Chapter 5. Parallelization OpenMP The rapid increase in the multi-core processor architectures these days, have stretched the boundaries of computing performance. Using multi-threaded OpenMP platform, we can leverage these multiple cores for parallel processing of data and instructions. OpenMP is a set of complier directives and callable runtime library routines to facilitate shared-memory parallelism [82]. The #pragma omp parallel directive is used to tell the compiler that the block of code should be executed by multiple threads in parallel. We used OpenMP with C and re-implemented the SUBSCALE algorithm to parallelize the code using OpenMP directives. Dimensions in parallel The generation of signatures from the density chunks in each single dimension is independent of other dimensions. This observation makes the Step 2 of Algorithm 12, a strong candidate for parallelization. As shown in Figure 5.4, we can divide the dimensions among the threads such that each thread will process its share of 1-dimensional data points to compute the signatures. Each thread runs on a separate processing core and dimensions can be distributed equally or unequally among the threads. If t is number of threads being used, then t out of total k dimensions can be processed in parallel, assuming t < k. A hash table hT able is shared among the threads to store the signatures as soon as they are generated (Figure 5.3). The information about the collisions among signatures from the different dimensions is also stored in this hash table. The Algorithm 12 can be modified to process the dimensions in parallel as in Algorithm 13. The heuristics can be used to fix the number of slots in the hT able. However, the problem with this set up is that every time a thread accesses the shared hT able to hash a signature, it would have to take exclusive control of the required memory slot. Without the mutual exclusive access, two threads with the same signatures generated from two different dimensions, would overwrite the same slot of hT able. The overwriting 89 Chapter 5. Parallelization --2 --- 3 --- --1 Thread t Thread i Thread 1 --- k-1 k Dimensions Figure 5.4: Parallel processing of SUBSCALE algorithm. Each dimension is allocated a separate thread and each thread compute the density chunks and its signatures independent of other threads would lead to lose of information on the maximal subspace related to this signature. The maximal subspace of a dense unit can only be found by having the information about which dimensions generated this dense unit. The OpenMP provides a lock mechanism for the shared variables but its synchronisation adds to the overhead as well as thread contention. When dimensions are being processed in parallel, a large number of combinatorial signatures will be generated. The number of signatures being mapped to same slot will depend on their sum value, numSlots in the hT able and hashing function (modulo in this chapter) being used. The smaller hash table would lead to frequent requests for exclusive access to the same slot from different threads. It can be argued that a large hash table would result in decrease in lock contention but then, the number of locks to be maintained would grow proportional to the slots in the hash table. Also, the allowed total size of a hash table depends on the available working memory. We discuss the results from this method in the section 5.4. Slices in parallel The sharing of a common hash table among threads is a bottleneck for the speed up expected through parallel processing of dimensions. The signatures generated from all of the dimensions need to be processed in order to identify which of these collide. Without a shared data structure to hold these signatures, we would not know which other dimensions 90 Chapter 5. Parallelization Input: DB : n × k data, a set of n keys:K Output: Dense points in maximal subspaces 1 Initialize a common hash table hT able 2 numT hreads ← k 3 #pragma omp parallel num threads(numThreads) shared(DB, hT able, K) 4 { 5 #pragma omp for 6 for dimension j ← 1 to k do 7 Scan {P1j , P2j , . . . , Pnj } and find density chunks 8 for each density chunk do 9 create signatures and hash them to hT able in a mutually exclusive way 10 end 11 end 12 } Collect all collisions from hT able to output dense points in maximal subspaces Algorithm 13: Modified SUBSCALE algorithm to execute multiple dimensions on multiple cores and with a shared hash table Input: DB : n × k data, a set of n keys:K, SP Output: Dense points in maximal subspaces 1 numT hreads ← k 2 R ← (max(K) − min(K)) × (τ + 1) R 3 SLICE = SP 4 #pragma omp parallel num threads(numThreads) shared(DB, K) private(LOW ,HIGH) 5 { 6 #pragma omp for 7 for dimension split ← 0 to SP − 1 do 8 Initialize a new hash table hT able 9 LOW = min(K) × (τ + 1) + split ∗ SLICE 10 HIGH = LOW + SLICE 11 for dimension j ← 1 to k do 12 Scan {P1j , P2j , . . . , Pnj } and find density chunks 13 for each density chunk do 14 create signatures between LOW and HIGH and hash them to hT able 15 end 16 end 17 Collect all collisions from hT able to output dense points in maximal subspaces 18 Discard hT able 19 end 20 } Algorithm 14: Modified SUBSCALE algorithm to execute multiple slices on multiple cores and separate hT able. 91 Chapter 5. Parallelization are generating the same signature. Although we cannot split the hash table, we can split the generation of signatures so that, only sums within a certain range are allowed in the hash table at a time. As discussed in 8 above, the SUBSCALE algorithm proposed splitting the range R of expected signature values among multiples slices. Since these slices can be processed independent of each other, multiple threads can process them in parallel as in Algorithm 14. Each slice requires a separate hash table. Though this approach helps to achieve faster clustering performance from the SUBSCALE algorithm but the memory required to store all of the hash tables can still be a constraint. Since R denotes the whole range of computation sums that are expected during the signature generation process, we can bring these slices into the main working memory one by one. Each slice is again split into sub-slices to be processed with multiple threads as explained in Algorithm 15. The results and their evaluation are discussed in the section. 5.4 5.4.1 Results and Analysis Experimental setup The experiments were carried out on the IBM Softlayer Server with 48 cores, 128 GB RAM and Ubuntu 15.04 kernel. The hyper-threading was disabled on the server so that each thread could run on a separate physical core and the parallel performance could be measured fairly. The parallel version of the SUBSCALE algorithm was implemented in C using OpenMP directives. The for loop directive #pragma omp parallel was used to allocate work to the multiple cores. Also, we used 14-digit non-negative integers for the key database. 92 Chapter 5. Parallelization Input: DB : n × k data, a set of n keys:K,SP ,innerSP Output: Dense points in maximal subspaces 1 numT hreads ← k 2 R ← ((max(K) − min(K)) × (τ + 1) R 3 SLICE = sp 4 for split ← 0 to SP − 1 do 5 LOW = min(K) × (τ + 1) + split × SLICE SLICE 6 innerSLICE = innerSP 7 #pragma omp parallel num threads(numThreads) shared(DB, K) firstprivate(LOW ) 8 { 9 #pragma omp for 10 for dimension split ← 0 to SP − 1 do 11 Initialize a new hash table hT able 12 innerLOW = LOW + innersplit × innerSLICE 13 innerHIGH = innerLOW + innerSLICE 14 for dimension j ← 1 to k do 15 Scan {P1j , P2j , . . . , Pnj } and find density chunks 16 for each density chunk do 17 create signatures between innerLOW and innerHIGH and hash them to hT able 18 end 19 end 20 Collect all collisions from hT able to output dense points in maximal subspaces 21 Discard hT able 22 end 23 } 24 end Algorithm 15: Modified SUBSCALE algorithm to execute multiple subslices on multiple cores 93 Chapter 5. Parallelization 5.4.2 Data Sets The synthetic datasets may contain inherent bias for the underlying data distribution, therefore, we used real data sets for our clustering experiments. The two main datasets for this experiment: 4400 × 500 madelon dataset and 3661 × 6144 pedestrian dataset, are publicly available. The madelon data is available at the UCI repository [69] and the pedestrian dataset was created through the attributed pedestrian database [77, 78] using the Matlab code in given in APiS1.0 [79]. 5.4.3 Speedup with multiple cores We compared the runtime performance of modified SUBSCALE algorithm using multiple threads running in parallel on upto 48 cores. The first attempt was to compute the dense units in all single dimensions in parallel. Multiple cores for dimensions We used 500 dimensional madelon data set with = 0.000001, τ = 3 and with these parameters, the total signatures from all of the single dimensions in madelon dataset were calculated at 29350693. The total number of signatures can be pre-calculated from the dense chunks in all dimensions. Some of these signatures will collide in the common hash table hT able, shared among the threads. As discussed in the previous subsection, the shared hT able will eventually lead to memory contention whenever multiple threads try to access the same slot of hT able simultaneously. Since the frequency of this contention depends upon the number of slots in the hT able, thus, we experimented with three different number of slots in the shared hT able: 0.1 million, 0.5 million and 1 million. The Figure 5.5 shows the results for runtime performance of madelon data set by using multiple threads for dimensions. We can see that performance improves slightly by processing dimensions in parallel but as discussed before, the mutual exclusive access of the same slot of shared hash table results in the performance degradation. 94 Chapter 5. Parallelization 1000 Runtime (s) 800 100000 slots in hTable 500000 slots in hTable 1000000 slots in hTable 600 400 200 0 12 4 8 16 32 No. of threads 48 Figure 5.5: Dataset: madelon : 4400 × 500, Parameters: = 0.000001andτ = 3. Total dimensions are distributed among threads and these threads run in parallel on separate cores. Each thread computes the density chunks and its signatures independent of other threads. One dimension per thread is processed at a time. Runtime measured with respect to the number of threads and the number of slots in the hT able. Multiple cores for slices The next step is to avoid this memory contention which arises due to the simultaneous access of the same slot of the shared hash table. This happens when signatures with the same sum value or signatures with the different sum value but same hash output, are generated simultaneously from different threads running on different dimensions. If the threads could generate the signatures requiring different slots at all times, this memory contention can be avoided. We re-implemented the scalable version of the SUBSCALE algorithm using OpenMP threads. But instead of running threads on dimensions, we ran them on slices created using the split factor discussed before. This implementation does not require the use of lock mechanism for shared access to memory. As discussed in 8, R sp was used to approximate the numSlots value for each hT able. Figure 5.6 shows the results of the runtime versus the number of threads used for processing the slices of the madelon data set. We used the same value of and τ as used 95 Chapter 5. Parallelization Runtime (s) 800 sp=200 sp=500 sp=100 sp=1500 sp=2000 600 400 200 0 1 4 8 16 32 No. of threads 48 Figure 5.6: Dataset: madelon : 4400 × 500, Parameters: = 0.000001andτ = 3. The slices are distributed among threads these threads run in parallel on separate cores. Each thread computes the density chunks and its signatures independent of other threads. One slice per thread is processed at a time. Runtime measured with respect to the number of threads and the split factor sp. The overhead of using threads surpasses the performance gain when only 4 slices are being processed by each core. for the shared hT able with lock mechanism above. The hash computation was sliced with different values of split factor sp ranging between 200 and 2000. These slices of the hash computation were divided among multiple cores to be run by separate threads in parallel. We can see the performance boost by using more number of threads. The speed up is significant when there are more slices to be processed. Hence, multiple cores can reduce the runtime significantly when more work needed to be done using large value of sp. The speed up for the same experiment is shown in Figure 5.7, which becomes linear as the number of slices increases. Scalability with the dimensions We were motivated by the results given by the madelon dataset, so we experimented with the 6144 dimensional pedestrian dataset to study scalability and speed up with higher number of dimensions. Using parameters of = 0.000001 and τ = 3, a total of 19860542724 signatures in all single dimensions are expected from the pedestrian 96 Chapter 5. Parallelization 20 Speedup 15 10 sp=200 sp=500 sp=100 sp=1500 sp=2000 5 0 1 4 8 16 32 No. of threads 48 Figure 5.7: Speed up for the results give in Figure 5.6. As the number of slices increases, the efficiency gain from multi-core architectures increases. With sp=200, the number of slices per core can vary from 200 to 4, depending upon the number of threads. dataset. Considering each entry in the hash table stores the tau + 1 dense points (say 16 bytes for τ = 3 on a typical computer), value of a large digit signature to be matched (8 bytes) and to store the dimensions being collided (16 bytes for an average of 2-dimensional subspace). The total memory required to store an entry would be 40 bytes approximately. Therefore, 19860542724 expected signatures would require ∼ 592 GB of working memory to store the hash tables. There would be additional requirements for memory to store the temporary data structures being used during the computation process. To overcome this huge memory requirement, we can split these signature computations two times. We used split factor of 60 to bring down the memory requirement for total hash tables. Each of these 60 slices were further split into 200 subslices to be run on multiple cores. The memory requirements for hT able are different for each slice. We found in our experiments that different number of signatures were generated in different slices. As shown in Figure 5.8, the number of signatures seem to follow the famous bell curve and relatively large number of signatures were generated in the middle of splitting. We investigated the values of 14-digit keys which were generated randomly and mapped to 97 Chapter 5. Parallelization No. of Signatures (Millions) 1000 padestrian dataset 800 600 400 200 0 0 10 20 30 Slice number 40 50 60 Figure 5.8: Dataset: pedestrian : 3661 × 6144, Parameters: = 0.000001 and τ = 3, sp = 60. Number of signatures being generated in each of the 60 slices. A large number of signatures are generated towards the middle slice numbers the 3661 points of the pedestrian dataset. The keys seem to follow no particular bias and their values were randomly lying in the full space of range between 1.0E14 to 1.0E15 (Figure 5.9). Instead of the user defined number of slots for hT able, as we did for madelon dataset, we divided these total signatures by the split factor sp to approximate the memory requirement for each slice. The size of hT able was calculated as totalSignatures . sp We can see that the execution time decreases drastically with increase in number of threads. We used pedestrian dataset with parameters: = 1.0E − 6, τ = 3, outerSplit = 60, innerSplit = 200. The total signatures expected were 19860542724 and so we divided this number by outerSplit to declare a hT able with sizefig:keys 1655045. It took around 26 hours to finish processing total 60 slices and each slice being split into 200 sub slices and processed in parallel with 48 threads. The sequential version of the SUBSCALE algorithm has reported ∼ 720 hours to process this data. 98 Chapter 5. Parallelization 13 14−digit key values 10 x 10 8 6 4 2 0 500 1000 1500 2000 Data Point ID 2500 3000 3500 Figure 5.9: The distribution of values of 3661 keys used for pedestrian dataset. No two keys are same and are generated from the full space of the 14-digit integer domain. Keys are not generated in any particular ascending or descending order. These 3661 keys are mapped one:one to the 3661 points 5.4.4 Summary The SUBSCALE algorithm introduced in chapter 3 and 4 can find non-trivial clusters in high-dimensional data sets. However, the time complexity of the SUBSCALE algorithm and its scalable version depends heavily on the computation of 1-dimensional dense units. To further reduce the computational complexity, parallelization is the only choice. In this chapter, we have used largely available shared memory multi-core architectures to parallelize the SUBSCALE algorithm. We have developed and implemented various approaches to compute the dense units in parallel. The results with upto 6144 dimensions have shown linear speed up. In future, we aim to utilise the General Purpose Graphic Processing Units to further speed up the execution time of this algorithm. Chapter 6 Outlier Detection 6.1 Introduction With the evolution of information technology, increasingly detailed data is being captured from a wide range of data sources and mechanisms [3]. While additional details about the data increase the number of dimensions, the consolidation of data from different sources and processes can lead to wider possibilities of introduction of errors and inconsistencies [85]. In addition to the need for better data analysis tools, concerns about data quality have also grown tremendously [86, 87]. The real data is often called ‘dirty data’ or ‘bad data’ as it inevitably contains anomalies like wrong, invalid, missing or outdated information [88]. The anomalies are basically the abnormal values in the data and are also known as outliers. The outliers can arise from an inadequate procedure of data measurement and collection, or an inherent variability in the underlying data domain. The presence of outliers can have disproportionate influence on data analysis [89]. Data analysis is a foundation of any decision making process in a data-driven application domain. Poor decisions propelled by poor data quality can result in significant social and economic costs including threat to national security [90–92]. In 2014, US postal service lost $1.5 billions due to wrong postal addresses [93]. The widespread impact of poor 99 100 Chapter 6. Outlier Detection quality data is also revealed from a recent report which says that 75% of companies waste an average of 14% of revenue on bad data [94]. In some of the critical areas like health sector, poor data quality can lead to wrong conclusions and can have life-threatening consequences [95, 96]. Evidence-Based Medicine (EBM) is the process of using clinical research findings to aid clinical diagnosis and decision making by the clinician [97]. Although EBM is increasingly being used for clinical trials, the quality of patient outcomes depends upon the quality of data [98]. In addition to EBM, the quality of health care data also plays an important role in scheduling and planning hospital services [99]. Nonetheless, the quality of data depends upon the context in which it is produced or used [100]. The broader meaning of data quality has evolved from the term ‘fitness for use’ proposed in a quality control handbook by Juran [101]. Although, efforts have been made to define data quality in terms of various characteristics like accuracy, relevance, timeliness, completeness, and consistency [102, 103], there is no single tool which can solve all of the the data quality problems. In fact, the problem of ‘data quality’ is multifaceted and usually requires domain knowledge and multiple quality improvement steps [104–106]. Data cleaning, also known as scrubbing or reconciliation or cleansing, is an inherent part of data preprocessing used by data warehouses in order to improve data quality [5]. Maletic and Marcus [107] enumerated the steps for data cleansing process which includes identifying the anomalous data points and applying appropriate corrections or purges to reduce such outliers. The domain experts usually intervene in the cleaning process because their knowledge is valuable in identification and elimination of outliers [108]. Additionally, a significant portion of data cleaning work has to be done manually or by low-level programs that are difficult to write and maintain [85, 86]. Needless to say, data cleaning is a time consuming and expensive process. According to Dasu and Johnson [85], 80% of the total time spent by a data analyst is only on the data cleaning part. The increase in high-dimensional data these days poses further challenges for data cleaning. The main reason is that the outliers in a high-dimensional data are not as obvi- 101 Chapter 6. Outlier Detection ous as a univariate or even low-dimensional data. The normal and abnormal data points exhibit shared behaviour among multiple dimensions. The problem is further exaggerated by the surprising behaviour of distance metrics in higher dimensions known as the curse of dimensionality (also, discussed in chapter 1) [8, 11]. The state-of-the-art traditional methods do not work for outlier detection in the high-dimensional data [37, 109, 110]. In this chapter, we focus our interest on data cleaning aspect through efficient identification of outliers in high-dimensional data. We also endeavour to characterise each outlier with a measure of outlierness, which can aid the analyst to make an informed decision about the outlier. In the next section, we discuss the issues pertaining to outliers while cleaning highdimensional data. We discuss related outlier detection methods in section 6.3. In section 6.4, we propose our approach to deal with the high-dimensional data for outlier detection. 6.2 Outliers and data cleaning The detection and correction of anomalous data is the most challenging problem within data cleaning. According to Hawkins [111], an outlier or an anomaly is an observation that deviates so much from the rest of the observations so as to arouse suspicion about its origin. Quite often, outliers skew the data or bring an other dimension of complexity into data models, making it difficult to accurately analyse the data. Outliers may be of interest for several other reasons too, for example, apart from data cleaning, outlier detection has enormous applications in fraud detection, criminal activities, gene expression analysis, and environmental surveillance. There are different ways to handle univariate outliers. Statistical methods are very common for data cleaning which are based on the Chebynsev theorem [112]. The points beyond a certain standard deviation are termed as outliers using a confidence interval. However, univariate or even low-dimensional outliers are usually obvious and can be detected through visual inspection or using traditional approaches. 102 Chapter 6. Outlier Detection Table 6.1: Outlier removal dilemma Data points P1 P2 P3 P4 Dimensions d1 d2 d3 11 50 60 10 52 63 12 250 62 101 49 03 But for high-dimensional data (also called multivariate data), the outliers are hidden in the underlying subspaces. In Table 6.1, a data point P3 seems to have an abnormal value in dimension d2 which might be representing age of a person. However, point P3 appears normal under subspace d1 , d3 . Similarly, the data point P4 has abnormal values in dimensions d1 and d3 but appears to be normal in dimension d2 . The outlierness of points P3 and P4 is still observable in this 3-dimensional dataset. But the detection of such outliers becomes very challenging for high-dimensional data as the number of possible subspaces becomes exponential with the increase in dimensions. Moreover, analysts are frequently faced with the dilemma of what to do with an outlier. In many cases, the available information and knowledge is insufficient to determine the correct modification to be applied to the outlier data points. On the one hand, removal of outliers may greatly enhance the data quality for further analysis and can be a cheaper practical solution than fixing them. But, on the other side, deletion of a data point Pi detected as an outlier can lead to a loss of information if Pi is not an outlier in all of the dimensions. This loss of information can be avoided by getting additional details about this point, for example, the number of subspaces this point is showing an outlying behaviour. The ranking of data points in the order of their outlierness also helps to focus on important outliers and deal with them accordingly. Both clustering and outlier detection are based on the notion of similarity among the data points. The clusters are the points lying in the dense regions while outliers are the points lying in the sparse regions of the data. Similar to clustering, state-of-the-art tradi- 103 Chapter 6. Outlier Detection tional distance or density methods do not work for outlier detection in high-dimensional data [37, 109, 110]. These methods look for outliers using all of the dimensions simultaneously. But due to the curse of dimensionality, all data points appear to be equidistant from each other in the high-dimensional space. The notion of proximity fails in the sparse high-dimensional space and every point appears to be an outlier. The outliers are complex in the high-dimensional data as the points are correlated differently under different subsets of dimensions. Referring to Figure 1.1 from Chapter 1, a data point can be a part of a cluster in some of the subspaces while it can exist as an outlier in the rest of the subspaces. Due to the exponential growth in the number of subspaces with the increase in dimensions of the data, finding outliers in all subspaces is a computationally challenging problem. There is an exigent need for efficient and scalable outlier detection algorithms for the highdimensional data [37, 113]. In this chapter we focus on utility of outlier detection in the data cleaning applications. In addition to detecting outliers, it is important and useful to further characterize each outlier with a measure of its outlierness in the form an outlier score. The outlier score can reveal the interestingness of an outlier to the data analyst. Most of the outlier detection algorithms work as a labelling mechanism to give a binary decision of a data point being an outlier or not [114]. Scoring and ranking the outliers can aid better understanding of the behaviour of outliers with respect to the rest of the data and can aid the data cleaning process. In the previous chapters, we proposed algorithms to find clusters embedded in the subspaces of the high-dimensional data sets. In this chapter, we utilize these algorithms to discover outliers embedded in the subspaces of the data. We also propose further characterisations of these outliers through their outlying score. Before discussing our approach, we survey the current state of outlier detection research in the next section, 104 6.3 Chapter 6. Outlier Detection Current methods for outlier detection There has been significant research work in the outlier detection area as detailed in the recent literature surveys [37, 110, 114]. Historically, the problem of outlier detection has been studied extensively in statistics by categorizing data points with low probability distribution as outliers especially by Barnett and Lewis [112]. However, this approach requires a prior knowledge of the underlying distribution of the data set, which is usually unknown for most large data sets. In order to overcome the limitations of the statistical-based approaches, distance and density based approaches were introduced [115, 116]. Still most of the work in outlier detection deals with low-dimensional data only. 6.3.1 Full-dimensional based approaches Knorr [115] suggested a distance-based approach such that the objects with less than k neighbors within distance λ were outliers. Its variant was proposed by Ramaswamy et al. [117] which takes the distance of an object to its k th nearest neighbor as its outlier score and retrieve the top m objects having the highest outlier scores as the top m outliers. In the same year, Breuing et al. [118] proposed to rank outliers using local outlier factor (LOF) which compares the density of each object of a data set with the density of its k-nearest neighbors. A LOF value of approximately 1 indicates that the corresponding object is located within a cluster of homogeneous density. The higher the difference of the density around an object compared to the density around its k-nearest neighbors, the higher is the LOF value that is assigned to this object. Later on, improvements over these outlier ranking schemes were proposed [119–121] but again they are based on the full dimensional space and face the same data sparsity problem in higher dimensions. Most proposed approaches so far which are explicitly or implicitly based on the assessment of differences in Euclidean distance metric between objects in full-dimensional space, do not work efficiently [122]. Some re- 105 Chapter 6. Outlier Detection searchers [116, 123] have used depth based approaches from computer graphics where objects are organized in convex hull layers expecting outliers with shallow depth values. But these approaches too fail in the high-dimensional data due to the inherent exponential complexity of computing convex hulls. Kriegal et al. [122] have used variance of angles between pairs of data points to rank outliers in high-dimensional data. If the spectrum of observed angles for a point is broad, the point will be surrounded by other points in all possible directions meaning the point is positioned inside a cluster and a small spectrum means other points will be positioned only in certain directions, indicating that the point is positioned outside of some sets of points that are grouped together. However, the method cannot detect outliers surrounded by other points in subspaces and the naive implementation of the algorithm runs in O(n3 ) time for a data set of n points. These traditional outlier ranking techniques using outlierness measures in full space are not appropriate for outliers hidden in subspaces. In the full space all objects appear to be alike so that traditional outlier ranking cannot distinguish the outlierness of objects any more. An object may show high deviation compared to its neighbourhood in one subspace but may cluster together with other objects in a second subspace or might not show up as an outlier in a third scattered subspace [124]. 6.3.2 Subspace based approaches The problem of outlier detection in subspaces has been mostly neglected by the research community so far. Although it is important to look into the subspaces for interesting and potentially useful outliers, the number of possible subspaces increases exponentially with increase in the number of dimensions. However, some authors [125, 126] have contended that not all attributes/dimensions are relevant for detecting outlying observations. Pruning the subspaces The complexity of the exhaustive search for all subspaces is 2k , where k is the data dimensionality. Our aim is to detect outliers in high-dimensional data by choosing the relevant 106 Chapter 6. Outlier Detection subspaces and then pruning the objects so as to minimize calculations for every object in the selected subspaces. One approach to deal with high-dimensional data is dimensionality reduction techniques like PCA (Principal Component Analysis) which map the original data space to a lower-dimensional data space. However, these methods may be inadequate to get rid of irrelevant attributes because different objects show different kind of abnormal patterns with respect to different dimensions. To reduce the search space, we rely on downward closure property of density enabling an Apriori-like search strategy. The subspaces can be pruned w.r.t. outliers based upon the properties: Property a. If an object is not an outlier in a k-dimensional subspace S, then it cannot be an outlier in any subspace that is a subset of S. Property b. If an object is an outlier in a k-dimensional subspace S, then it will be an outlier in any subspace that is a superset of S. Knorr and Ng [127] have proposed algorithms to identify outliers in subspaces instead of the full-attribute space of a given data set. Their main objective was to provide some intentional knowledge of the outliers, that is, the description or an explanation of why an identified outlier is exceptional. For example, what is the smallest set of attributes to explain why an outlier is exceptional? Is this outlier dominated by other outliers? Aggarwal et al. [128] then proposed a grid-based subspace outlier detection approach and they used sparsity coefficient of subspaces to detect outliers and used evolutionary computation as the subspace search strategy. Recent approaches have enhanced subspace outlier mining by using specialized heuristics for subspace selection and projection [109, 129, 130]. Muller and Schiffer [124] have approached this problem of subspace based outlier detection and ranking by first pruning the subspaces which are uniformly distributed and then ranking each object in the remaining subspaces using a kernel density estimation. Although recent work by Muller et al. [124] is a step towards subspace based outlier detection and ranking, it has its own limitations, for example, they reject few subspaces completely but then they calculate density for each and every object in the remaining 107 Chapter 6. Outlier Detection subspaces. Our aim is to efficiently prune subspaces as well as objects in the remaining subspaces. However, most existing approaches suffer from the difficulty of choosing meaningful subspaces as well as the exponential time complexity in the data dimensionality. 6.4 Our approach Our aim is to find outliers embedded in all possible subspaces of the high-dimensional data and then to efficiently characterize them by measuring their outlierness. Technically, exploring the exponential number of subspaces of high-dimensional data to detect relevant outliers is a non-trivial problem. The exhaustive search of the multi-dimensional space-lattice is computationally very demanding and becomes infeasible when the dimensionality of data is high. In the previous chapters, we have tackled the problem of subspace clustering and our proposed SUBSCALE algorithm is quite efficient and scalable with the dimensions. It is desirable that we utilise our already established technique to solve the problem of outlier detection and ranking in high-dimensional data. Our work is motivated by the following observations: 1. In a high-dimensional space, due to the curse of dimensionality, each data point is far away from each other and thus, it is difficult to find outliers using fulldimensional space. However, data points show interesting correlations among each other in the underlying subspaces. 2. For a k-dimensional data, there are 2k − 1 subspaces to be searched for each data point, which is a computationally expensive task. So, efficient pruning of subspaces as well as the data points is needed. Most of the literature on subspace pruning is based on heuristic measures. But, this random selection of subspaces is bound to generate random identification and ranking of outliers, giving poor results. We aim 108 Chapter 6. Outlier Detection at developing efficient and meaningful measures rather than heuristic selections of outliers. 3. We have no prior information about the underlying data distribution and the significant dimensions for detecting outliers. So, we focus our interest towards solving this problem using the unsupervised density based methods especially subspace clustering. Clustering, also known as unsupervised learning, distinguishes dense areas with high data concentration from sparse areas. As outliers have low density around them, we can explore these sparse areas. 4. Measuring the outlierness of an object is very important rather than just labelling it as an outlier or inlier. We aim at providing better ranking of the outliers based on their behaviour in different subspaces. Keeping in view the utilisation of outlier detection for data cleaning, we endeavour to aid the process of improving data quality through the outlier score of each data point. 5. We need to adapt our outlier detection technique according to the dimensionality of the subspaces. As the dimensionality increases, density of nearest neighbours too decreases, so our algorithm would be able to adjust the parameters accordingly. 6.4.1 Anti-monotonicity of the data proximity According to the downward closure of the dense regions proposed by Agrawal et al. [15] in their Apriori search strategy, a data point from a dense cluster in a subspace S will be a part of a dense region in all lower-dimensional projections of S. If we consider the anti-monotonicity property of the same principle, we can infer that an object which is an outlier in a subspace S will be an outlier in all higher-dimensional subspaces which are supersets of the subspace S. Consider distS (P1 , P2 ) as a proximity distance function between two points, P1 and P2 in a subspace S. (The similarity measures were discussed in chapter 1). If S1 and S2 are two different subspaces and subspace S2 is a superset of subspace S1 , that is, 109 Chapter 6. Outlier Detection Pi is an outlier Pi is not an outlier {1,2,3,4,5} {1,2,3,4} {1,2,3} {1,2,3,5} {1,2,4} {1,2,4,5} {1,2,5} {1,4} {1,5} {1,3,4} {2,3} {1,3,4,5} {1,3,5} {2,4} {2,4} {4} {1} ..... {5} Figure 6.1: Outlier in trivial subspaces. S2 contains all the dimensions from subspace S1 , then following property holds for the downward closure of search space for outliers: distS2 (P1 , P2 ) ≥ distS1 (P1 , P2 ) ⇐⇒ S2 ⊃ S1 (6.1) Thus, the distance between two points will not become any shorter as we move from a lower-dimensional subspace S1 to a higher-dimensional subspace S2 where S2 contains all dimensions of S1 and some more. This is the reason anti-monotonicity holds for outliers in multi-dimensional space. In the example given in Figure 6.1, a data point Pi first appears as an outlier in a subspace {1, 3} and is an outlier in all supersets of {1, 3} as shown by the shaded subspaces. The shaded subspaces contain redundant information about the outlier and are trivial. 110 6.4.2 Chapter 6. Outlier Detection Minimal subspace of an outlier We discussed in the previous subsection that if a point Pi is an outlier in a subspace S then it is also an outlier in all higher-dimensional subspaces, S 0 where S 0 ⊃ S. Each such subspace S 0 is known as a trivial subspace while S is a non-trivial subspace. If |S| denotes the number of dimensions in a subspace S, then we define a minimal subspace with respect to an outlier point Pi as follows: Definition 7 (Minimal subspace of an outlier). i is a minimal subspace with respect to an outlier point Pi if there exists A subspace Smin no other subspace S 0 where Pi appears as an outlier and dimensions in S 0 are a proper subset of dimensions in S, 1 ≤ |S 0 | < |Smin |. Alternatively, S is the minimal subspace for a data point Pi if for all lower-dimensional subsets of S, Pi appears as a part of some dense region. A data point Pi can appear as an outlier in many subspaces which fulfil the condition for minimal subspaces for this point. Let us denote the set of minimal subspaces i i for a data point Pi as Smin . The subspaces in Smin can be either partially-overlapping or non-overlapping with each other but none of the subspace is a complete subset of the other. In Figure 6.1, a point Pi can appear as an outlier for the first time in subspaces S1 = {1, 3}, S2 = {1, 5} and S3 = {2, 3, 4}. All of these three subspaces are minimal for an outlier point Pi . We notice that S1 ∪ S2 ∪ S3 = full dimensional space. Thus, no further minimal subspace will exist for the data point Pi . Detecting outliers in the minimal subspaces Our interest in finding minimal subspaces of each outlier is based on the intuitive idea that the cardinality of Smin gives an indication about the outlying behaviour of a data point. Observation 4. If m is the number of dimensions in a minimal subspace Smin such that m = |Smin | and m ≤ k, then a smaller value of m means that Pi is showing outlying 111 Chapter 6. Outlier Detection behaviour in a larger number of subspaces. Typically, Pi will show outlying behaviour in all of 2k−m − 1 higher-dimensional subspaces. The SUBSCALE algorithm detects the maximal subspace for each dense unit of points. If Smax is a maximal subspace for a set of dense units, it means that the points in these dense units will not appear together in the next higher subspaces which are supersets of S. However, the points in these dense units can appear as an outlier or participate in other dense units with other points in higher-dimensional subspaces. The behaviour of the dense points from subspace Smax in superset of subspaces will totally depend upon the underlying density distribution of these points in these subspaces. For example, we assume that two dense units Ua = {P1 , P2 , P3 , P4 } and Ub = {P1 , P5 , P6 , P7 } exist in dimensions d1 and d2 . Along with these two, suppose the dense unit Ub also exists in dimension d3 . Using SUBSCALE algorithm, the dense unit Ua will be detected in the maximal subspace {d1 , d2 } while the dense unit Ub will be detected in the maximal subspace {d1 , d2 , d3 }. Here, point P1 is part of another dense unit in higher dimensional subspace {d1 , d2 , d3 }. Starting with the 1-dimensional dense units, the points which do not participate in a dense unit in a particular dimension are the 1-dimensional outliers. These outlier points are easy to detect from 1-dimensional dense units. For example, if there are 7 points in total out of which 2 points fail to participate in any of the 1-dimensional dense units in a dimension dj , then dj is the minimal subspace for these 2 outlier points. Some of the dense points from 1-dimensional dense units might not participate in any of the 2-dimensional dense units. These will be outliers in the 2-dimensional space of the data. These 2-dimensional outliers were not detected in single dimensions and appear for the first time in the 2-dimensional subspaces. But, it is hard to detect these outliers from the 2-dimensional maximal subspaces of the clusters given by the SUBSCALE algorithm. As discussed in the previous example, from maximal subspace {d1 , d2 }, we only know about dense unit Ua = {P1 , P2 , P3 , P4 } and cannot deduce that the remaining points (DB − {P1 , P2 , P3 , P4 }) will be outliers in the subspace {d1 , d2 }. The reason is that some 112 Chapter 6. Outlier Detection of the remaining points might participate in other dense units, which also exist in the additional single-dimensions and thus, would show up as the higher-dimensional dense units, for example Ub in this case. Thus, given a maximal subspace cluster, it is difficult to find outliers directly. One solution is to take the projections of all maximal subspaces in their lower-dimensional subspaces (if they exist). In the previous example, since {d1 , d2 } is a projection of subspace {d1 , d2 , d3 }, we can concatenate the dense points from dense units Ua and Ub . Thus DB − (P1 , P2 , P3 , P4 , P5 , P6 , P7 ) are the outliers in the subspace {d1 , d2 }. But we cannot say that {d1 , d2 } is a minimal subspace for these outliers. Because some of the outliers would have made their first appearance in the lower-dimensional subspaces (single dimensions in this case). To reiterate, a subspace is a minimal subspace of an outlier, if a data point has appeared for the first time in this subspace as an outlier and it is not an outlier in all lower-dimensional projections of this subspace. In each of the detected maximal subspaces S, the set of points which are not part of the cluster in this subspace are the outliers. Let us denote this set of outlier points as O. Some of these outliers will be old outliers O0 showing up from lower-dimensional subspaces such that each such lower-dimensional subspace S 0 is a subset of S. Thus, the subspace S will be a minimal subspace for the points in O − O0 . Ranking outliers using minimal subspaces The ranking decision of an outlier cannot be taken until all outliers have been discovered in all possible minimal subspaces. A score needs to be assigned to each outlier in each of the relevant minimal subspace. The scores can be accumulated for each outlier to find its total score, which decides the rank of this outlier with respect to the other data points. The number of subspaces in which an object is showing outlying behaviour, contributes to the strength of its outlyingness. The more the number of subspaces in which an outlier appears, stronger it should be weighed. Thus, following observation 4, an outlier which was first detected in lower-dimensional subspaces should have bigger score than 113 Chapter 6. Outlier Detection the outlier which was first detected in higher-dimensional subspaces. Another argument is that due to the curse of dimensionality, the probability of data existing in the clusters is more in the lower-dimensional subspaces. Therefore, if a data point is not able to group together with other data points in the high-probability subspaces then it should be given higher score as an outlier. Based on observation 4, we can use the number of dimensions in the minimal subspace of an outlier as a measure of the score. Let us assume that in addition to the minimal subspace S1 = {1, 3} (as shown in figure 6.1), the point Pi also exists as an outlier in a minimal subspace S2 = {1, 5}. Since Pi is an outlier in S1 , it will be an outlier in 25−2 − 1 = 7 higher-dimensional superset subspaces. Similarly, with reference to subspace S2 , Pi is again expected to be an outlier in 7 subspaces which are supersets of S2 . We can assign a score of 7 + 7 = 14 to point Pi for both subspaces S1 and S2 . But there is a problem in this approach. There will be some common subspaces which contain all dimensions of S1 ∪ S2 , for example, {1, 3, 5} and {1, 3, 4, 5}. In this particular example, there will be 25−3 = 4 redundant subspaces. So the correct outlier score for the point Pi will be 14 − 4 = 10. It is possible that an outlier which was discovered in many higher-dimensional subspaces might end up having total score more than the outlier which was discovered in only few lower-dimensional subspace. The total score can only be found by applying corrections for the common subspaces between peer minimal subspaces (defined below), for example S1 and S2 in above case. For example, in Figure 6.1, S1 = {1, 2, 3} and S2 = {1, 2, 4} are peer subspaces as both are 3-dimensional subspaces with different sets of attributes. But, there will be some common subspaces which contains all dimensions of S1 and S2 . Definition 8 (Peer subspaces). We define peer subspaces as the subspaces which have the same number of dimensions but have atleast one dimension different from each other. 114 Chapter 6. Outlier Detection To calculate the total outlier score of each point, all of its peer minimal-subspaces should be first found and then the correction for the common subspaces is applied. This i process involves matching every subspace from Smin with the other subspaces. We notice that the process of ranking outliers using minimal subspace theory is a two step process. The first step is to find the minimal subspace of each point by processing the old and new outliers as discussed in section 6.4.2. The second step is to process the set i Smin for each point Pi and find the total score of each outlier. Each data point will have its own set of minimal subspaces where the number and sizes of these minimal subspaces will be different for each point. Thus, this approach seems to include the computational expense of firstly, keeping a track of old outliers while calculating minimal subspaces, then, matching each and every subspace in the minimal subspace set of each point. An alternative approach is to score the data points using the inliers that is the points which are not outliers. In the next section, we introduce the concept of maximal subspace shadow of an outlier. 6.4.3 Maximal subspace shadow A data point can appear as an outlier in some of the subspaces while it can appear as an inlier in other subspaces. An inlier means that a data point is part of some dense unit (or cluster) in other subspaces. We define maximal subspace shadow of a data point Pi as the maximal subspace S up to which it could survive without showing up as an outlier. Pi will cease to be part of a cluster in any subspace S 0 which is a superset of S. In a subspace hierarchy, the higher a data point can rise without being an outlier, weaker it will become as an outlier. The maximal subspace shadow S of an outlier point Pi is like a shadow of an outlier which was dense until subspace S and the shadow will no longer exist in all supersets of subspace S. It is important to note that like minimal subspaces, a data point can have many maximal subspace shadows existing in different subspaces. Also, none of these shadow subspaces related to the same point will be superset or subset of the other. 115 Chapter 6. Outlier Detection The maximal subspace shadow is easier to calculate than the minimal subspace for each outlier. As we already have the SUBSCALE algorithm which directly finds the maximal subspace clusters, some of the data points in these clusters would never appear as a part of some other cluster in the superset maximal subspaces. For each maximal subspace S found by the SUBSCALE algorithm, we can iterate through each of the supersets S 0 of the subspace S. Due to the Apriori principle, all dense points in the maximal subspace S 0 are also dense in the lower-dimensional subset subspace S. Once we remove those points from S which also exist in S 0 , we have the set of points whose maximal subspace shadow is S. Once we have calculated the maximal subspace shadows of each data point, we assign scores to it. Algorithm 16 shows the steps to calculate the rank of all points using the SUBSCALE Algorithm. We use the size of each of the detected maximal subspace shadows to assign outlier score to a point. A higher score is assigned to a point whose maximal subspace shadow lies in a lower-dimensional subspace than the point whose maximal subspace shadow lies in a higher-dimensional subspace. The number of dimensions in the maximal subspace shadow count towards the scoring. For example, if a point P1 has a maximal subspace shadow S : {d1 , d2 } then its score is increased by k −2, where 2 is dimensionality of S. A k-dimensional data can have subspaces of 1, 2, 3, 4, . . . k dimensionality. Since k k the number of subspaces are higher toward the start , , . . . and the end 2 3 k k , , . . . of this dimensionality set, we normalise the score calcuk−2 k−3 lation for each maximal subspace shadow of dimensionality r by dividing the score by k as in step 16 of Algorithm 16. r Also, if a data point exists as an outlier in a 1-dimensional subspace, that is, it is not present is any of the core set created using -neighbourhood, then it should be strongly scored. We penalise such points by adding 1 to the rank for each single dimension they appear as an outlier, but it can be set to any other high value as well. These points are said to have a maximal subspace shadow of size zero. 116 Chapter 6. Outlier Detection Input: n: total number of data points; k: total number of dimensions; Clusters: set of clusters in their maximal subspaces. Output: Rank: rank of each of the n points. The higher the score, stronger a point is as an outlier. /* Initialise ranks for all points to 0. 1 2 3 4 5 6 7 8 9 */ for i ← 1 to n do Rank[Pi ] ← 0 end for j ← 1 to k do Find core-sets in dimension j using Algorithm 8 with density threshold parameters and τ for each point Pi not participating in any of the core-sets in dimension j do Rank[Pi ] ← 1 end end /* Each entry in the Clusters is < P, S > where, P is a set of points which are dense in a subspace S. Clusters are found by using SUBSCALE algorithm discussed in previous chapters. */ 10 11 12 13 14 15 16 17 for each entry < P, S > in Clusters (including all 1-dimensional clusters) do X ← null for each entry < P 0 , S 0 > in Clusters where S 0 ⊃ S do Append P 0 to X end P ← P − (P ∩ X) for Pi ∈ P do k Rank[Pi ] ← Rank[Pi ] + (k − |subspace|)/ |subspace| /* |subspace| is the number of dimensions in the subspace. end 19 end Algorithm 16: Rank outliers: Rank the outliers based on SUBSCALE algorithm 18 */ 117 6.5 Chapter 6. Outlier Detection Experiments We experimented with four different data sets: shape (160×17), Breast Cancer Wisconsin (Diagnostic) (569 × 30), madelon (4400 × 500), and Parkinsons disease (195 × 22). The shape dataset is taken from OpenSubspace Project page [67] and the rest of the datasets are freely available at UCI repository [69, 131]. We used the SUBSCALE algorithm to find clusters in all possible subspaces for each dataset. Then, we calculated the maximal subspace shadows of the data points using Algorithm 16. We evaluated the outlier scores of each data point with different -values. When we increase the -value, due to increase in the neighbourhood radius, more points will be packed in the clusters. Thus, we expect the overall outlier scores to drop with the bigger value of parameter. This is evident from the graphs as shown below. Figure 6.2 shows the outlier score of a small 17-dimensional shape dataset with three values: 0.01, 0.02, 0.03. The overall outlier score ranges between 0 and 17. A data point with 0 outlier score means that it didn’t appear as an outlier in any of the subspaces and is part of some cluster in the k-dimensional higher subspace. It also implies that there will be atleast τ more points with 0 score. Figure 6.3 shows the outlier ranking for 22-dimensional Parkinsons disease dataset of 195 points. Each data point corresponding to a life is classified in the original dataset as either diseased or healthy. Out of total 195 points, 147 are diseased. We assume that the top 147 outliers should convey the information about diseased data points. We calculated the true positives (TP), false positives (FP), true negatives (TN) and false negatives (TP) for our results under three different settings. The Table 6.2 display these results includ P P P , recall T PT+F and fall-out F PF+T rate. The outliers were ing precision T PT+F P N N predicted with more than 82% of precision and recall. Similar to the above dataset, we also experimented with the Breast Cancer (Diagnostic) dataset which is bigger than the Parkinsons disease dataset with 30 dimensions and 569 data points. There are 212 malignant and 356 benign data points as given in the original data description. Thus, we analysed the top 212 outliers for being malignant using 118 Chapter 6. Outlier Detection 20 ε=0.01 ε=0.02 ε=0.03 Outlier score 15 10 5 0 0 40 80 Data points 120 160 Figure 6.2: Outlier scores for shape dataset (160 data points in 17 dimensions). The scores are evaluated with three different -values: 0.01, 0.02, 0.03. 25 ε=0.001 ε=0.005 ε=0.03 Outlier score 20 15 10 5 0 0 50 100 Data points 150 200 Figure 6.3: Outlier scores for Parkinsons Disease dataset (195 data points in 22 dimensions). The scores are evaluated with three different -values: 0.001, 0.005, 0.01. 119 Chapter 6. Outlier Detection Table 6.2: Evaluation of Parkinsons disease dataset -value 0.001 0.005 0.03 TP 116 121 106 FP 31 26 41 TN 17 22 7 FN 31 26 41 Precision(%) 78.9 82.3 72.1 Recall(%) 78.9 82.3 72.1 fall-out(%) 64.6 54.2 85.4 Table 6.3: Evaluation of Breast Cancer dataset -value 0.001 0.005 0.01 TP 160 173 136 FP 52 39 76 TN 305 318 281 FN 52 39 76 Precision(%) 75.5 81.6 64.2 Recall(%) 75.5 81.6 64.2 fall-out(%) 14.6 10.9 21.3 three different parameters. The results are plotted in Figure 6.4. However, the fluctuations between outlier scores seems to be more for = 0.005. We outliers detected from Breast Cancer (Diagnostic) dataset were also evaluated for performance of our algorithm and the results are given in Table 6.3. The outliers were predicted with more than 81% of precision and recall. As with the other datasets, we can see a reduction in overall outlier scores with bigger epsilon(). The reason is that a large -value results in more points being packed into the cluster and therefore, there are very few points which are left out as outliers. Since the scores are calculated for those data points which are not participating in the clusters, there will be reduction in the score values with larger . We also analysed the performance of our outlier ranking through precision, recall and fan-out as done for the above dataset. The results seem better with = 0.005. We have used heuristics to decide on the epsilon value. A preliminary test was done on the data to choose a minimum starting epsilon which can generate clusters in atleast one or two different subspaces. Finally, we experimented with a 500-dimensional madelon dataset of 4400 points with : 0.000001, 0.000005, 0.00001. Similar trends between -value and the overall ranking can be seen in this dataset as well. 120 Chapter 6. Outlier Detection ε=0.001 ε=0.005 ε=0.01 30 Outlier score 25 20 15 10 5 0 100 200 300 Data points 400 500 Figure 6.4: Outlier scores for Breast Cancer (Diagnostic) dataset (569 data points in 30 dimensions). The scores are evaluated with three different -values: 0.001, 0.005, 0.01. 121 Chapter 6. Outlier Detection Outlier score 500 ε=0.000001 ε=0.000005 ε=0.00001 480 460 440 420 0 1000 2000 3000 Data points 4000 Figure 6.5: Outlier scores for madelon dataset (4400 data points in 500 dimensions). The scores are evaluated with three different -values: 0.000001, 0.000005, 0.00001. All of our ranking computations for these four different datasets took between few milliseconds to 5 minutes. 6.6 Summary Poor quality data hampers the efficacy of data analysis and the decision making following it. Data is collected through automatic or manual processes. These processes introduce sources of error or may be the data itself can be an anomaly. The outliers are the data points showing an anomalous behaviour than the rest of the data. While ensuring the quality of data, data cleaning is a laborious and expensive process. Considering the important role played by the data quality in the credibility of decision making, we cannot escape data cleaning as a pre-processing step of data analysis. 122 Chapter 6. Outlier Detection In this chapter, we have presented an outlier detection and ranking algorithm for highdimensional data. Our approach is highly scalable with the dimensions of the data and efficiently deals with the curse of dimensionality. Our algorithm also gives further insight into the behaviour of each outlier by giving additional details about its relevant subspaces and the degree of outlierness exhibited by it. The outlier characterization is deemed important because it can help the users to evaluate the identified outliers and understand the data better. Chapter 7 Conclusion and future research directions In this thesis, we have worked on the challenging problem of subspace clustering as well as outlier detection and ranking in high-dimensional data. There has been a plethora of research work on clustering in the last few years, but due to the exponential search space with the increase in dimensions, analysing big datasets with high dimensions is a computationally expensive task. As the discussion of all of this work is out of scope for this thesis, we have highlighted some of the current and related work in Chapter 2. We have proposed a novel algorithm called SUBSCALE, which is based on number theory and finds all possible subspace clusters in the subspaces of high-dimensional data without using expensive indexing structures or performing multiple data scans. The SUBSCALE algorithm directly computes the maximal subspace clusters and is scalable with both size and dimensions of data. The algorithm finds groups of similar data points in all single dimensions based on -distance within 1-dimensional projections of the data points. These one-dimensional similar groups are broken into fixed sized chunks called dense units. The points in each dense unit is mapped to a unique key and the sums of the keys of the points in a dense unit is called its signature. The collision of such sig- 123 124 Chapter 7. Conclusion and future research directions natures from all single dimensions results in discovery of hidden clusters in the relevant multi-dimensional subspaces. Chapter 3 introduced the basic SUBSCALE algorithm while its scalable version is presented in Chapter 4. We have experimented with numerical datasets of upto 6144 dimensions and our proposed algorithm has been very efficient as well as efective in finding all possible hidden subspace clusters. The other state-of-the-art clustering algorithms have failed to perform for data of such high dimensionality. The work in Chapter 4 demonstrates that a combination of algorithmic enhancements to the SUBSCALE algorithm and distribution of the computations over a network of workstations can allow a large dataset to be clustered in just few minutes. In chapter 5, we have also presented the parallel version of SUBSCALE algorithm to reduce the time complexity for bigger datasets. The linear speedup with upto 48 cores looks very promising. However, the shared memory architecture of OpenMP is still a bottleneck due to the lock mechanism as discussed in Chapter 5. The Message Passing Interface (MPI) based parallel model can be explored for processing each dimension or slice locally and using intermittent communication with other nodes. Additionally, the computing power of General Purpose Graphics Processing Units (GPGPUs) can be harnessed by implementing SUBSCALE algorithm with OpenCL or CUDA. But the algorithm needed to be adapted to minimise the communication overhead while accessing the common hash table for collisions. The hash table can be managed centrally or replicated and synchronised periodically among the nodes. The efficient parallel clustering techniques are very much needed for cluster analysis in large-scale data mining applications in the future. Also, it would be interesting to look into details of the bell curve we came across in Chapter 5 (Figure 5.8). The padestrian data of 3661 points in 6144 dimensions was sliced using sp = 60. This graph was plotted for the total number of signatures generated across all single dimensions within a LOW and HIGH range of each of the slices. The large-integer keys assigned to these 3361 data points are completely random as plotted in 125 Chapter 7. Conclusion and future research directions Figure 5.9. The value of signatures generated from the combinations of dense points in each of the single dimensions is expected to be random. But, we notice that the number of signature values lying between LOW and HIGH range near the middle slice number (30 in this case) is the highest. In addition to exploring the reasons for high turn out of signatures in the centre, the SUBSCALE algorithm can be optimised further by spitting rather than in range [0, sp ) or [ sp , sp), where C is a constant the computations more near sp 2 C C which measures the number of cheaper computations towards the beginning or end of the number of slices. Although we have worked with numerical data with no missing values, our work can be further extended to deal with data with missing values. Using the closeness of data points in other subspaces, approximations can be made for missing values in correlated subspaces or dimensions. The concept of similarity based groups can be extended to categorical data too. While similarity measure for numeric data is distance based, for categorical data, the number of mismatches between data points or the similar categories among the data points can be used to find 1-dimensional similarity groups. These similarity groups can be further broken down into the dense units whose collisions can help find hidden subspace clusters. Finally, it would be interesting to see many real world applications of SUBSCALE algorithm especially in microarray data, anomaly detection in cyber space or financial transactions and/or other high-dimensional data sets. The quality and significance of discovered clusters and outliers can only be verified by the domain experts. We have proposed the outlier ranking algorithm in Chapter 6. The outliers with the highest scores are the most significant ones and can aid data analysts to set their priorities while cleaning the data. The three main contributions of this thesis are: 1. SUBSCALE: A faster and scalable algorithm to find clusters in the subspaces of high-dimensional data. 126 Chapter 7. Conclusion and future research directions 2. Variants of SUBSCALE contain further improvements to its performance. Also, the computations of SUBSCALE algorithm can be spread across distributed or parallel environment for speed-up. 3. Algorithm to detect and rank outliers by their outlying behaviour in the subspaces of high-dimensional data. We believe that with our novel algorithms presented in this thesis, we have been able to further the challenging research field of data analysis for high-dimensional data. We endeavour to continue to work in the future directions discussed in this chapter. Bibliography [1] T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Shermpan, M. Holko, A. Yefanov, H. Lee, N. Zhang, C. L. Robertson, N. Serova, S. Davis, and A. Soboleva, “Ncbi geo: archive for functional genomics data setsupdate,” Nucleic Acids Research, vol. 41, no. D1, pp. D991–D995, Jan 2013. [2] P. E. Dewdney, P. J. Hall, R. T. Schilizzi, and T. Lazio, “The square kilometre array,” Proceedings of the IEEE, vol. 97, no. 8, pp. 1482–1496, June 2009. [3] J. Fan, F. Han, and H. Liu, “Challenges of big data analysis,” National Science Review, vol. 1, no. 2, pp. 293–314, June 2014. [4] M. Steinbach, L. Ertöz, and V. Kumar, “The challenges of clustering high dimensional data,” in New Directions in Statistical Physics. Springer Berlin Heidelberg, 2004, pp. 273–309. [5] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, USA: Morgan Kaufmann Publishers, 2011. [6] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 2, pp. 224–227, Apr 1979. 127 128 Bibliography [7] S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” International Journal of Mathematical Models and Methods in Applied Sciences, vol. 1, no. 4, pp. 300–307, 2007. [8] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in International Conference on Database Theory. Springer Berlin Heidelberg, Jan 2001, pp. 420–434. [9] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: An efficient data clustering method for very large databases,” in Proc. of the ACM SIGMOD international conference on Management of data, vol. 25, no. 2. New York, USA: ACM Press, June 1996, pp. 103–114. [10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” International Conference on Knowledge Discovery and Data Mining, vol. 96, no. 34, pp. 226–231, Aug 1996. [11] R. E. Bellman, Adaptive control processes: A guided tour. New Jersey, USA: Princeton University Press, 1961. [12] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is nearest neighbor meaningful?” Proceedings of the 7th International Conference on Database Theory, vol. 1540, pp. 217–235, Jan 1999. [13] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” SIGMOD Record, vol. 28, no. 2, pp. 61–72, June 1999. [14] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” Advances in knowledge discovery and data mining, vol. 12, no. 1, pp. 307–328, Feb 1996. 129 Bibliography [15] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proc. of the ACM SIGMOD International conference on Management of Data, vol. 27, no. 2, June 1998, pp. 94–105. [16] L. Parsons, E. Haque, and H. Liu, “Subspace clustering for high dimensional data: a review,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 90–105, June 2004. [17] M. M. Babu, “Introduction to microarray data analysis,” in Computational Genomics: Theory and Application., G. RP, Ed. UK: Horizon Press, 2004, pp. 225–249. [18] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proceedings of the National Academy of Sciences, vol. 95, no. 25, pp. 14 863–14 868, 1998. [19] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression data: a survey,” IEEE Transactions on knowledge and data engineering, vol. 16, no. 11, pp. 1370–1386, Nov 2004. [20] Y. Cheng and G. M. Church, “Biclustering of expression data,” in Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, vol. 8. American Association for Artificial Intelligence, Aug 2000, pp. 93–103. [21] S. Yoon, C. Nardini, L. Benini, and G. De Micheli, “Discovering coherent biclusters from gene expression data using zero-suppressed binary decision diagrams,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), vol. 2, no. 4, pp. 339–353, Oct 2005. [22] C. Huttenhower, K. T. Mutungu, N. Indik, W. Yang, M. Schroeder, J. J. Forman, O. G. Troyanskaya, and H. A. Coller, “Detailing regulatory networks through large scale data integration,” Bioinformatics, vol. 25, no. 24, pp. 3267–3274, Dec 2009. 130 Bibliography [23] J. Jun, S. Chung, and D. McLeod, “Subspace clustering of microarray data based on domain transformation,” in VLDB Workshop on Data Mining and Bioinformatics. Springer Berlin Heidelberg, Sep 2006, pp. 14–28. [24] K. Eren, M. Deveci, O. Küçüktunç, and Ü. V. Çatalyürek, “A comparative analysis of biclustering algorithms for gene expression data,” Briefings in Bioinformatics, vol. 14, no. 3, pp. 279–292, May 2013. [25] R. Basri and D. W. Jacobs, “Lambertian reflectance and linear subspaces,” IEEE transactions on pattern analysis and machine intelligence, vol. 25, no. 2, pp. 218– 233, Feb 2003. [26] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2765–2781, Nov 2013. [27] R. Vidal, “Subspace clustering,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 52–68, Feb 2011. [28] J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman, “Clustering appearances of objects under varying illumination conditions,” in Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, June 2003, pp. I11–I18. [29] S. Tierney, J. Gao, and Y. Guo, “Subspace clustering for sequential data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE Computer Society, June 2014, pp. 1019–1026. [30] R. Vidal, R. Tron, and R. Hartley, “Multiframe motion segmentation with missing data using powerfactorization and gpca,” International Journal of Computer Vision, vol. 79, no. 1, pp. 85–105, Aug 2008. 131 Bibliography [31] S. Günnemann, B. Boden, and T. Seidl, “Finding density-based subspace clusters in graphs with feature vectors,” in Data Mining and Knowledge Discovery, vol. 25, no. 2. Springer, Sep 2012, pp. 243–269. [32] W. Jang and M. Hendry, “Cluster analysis of massive datasets in astronomy,” Statistics and Computing, vol. 17, no. 3, pp. 253–262, 2007. [33] S. G. Djorgovski, A. A. Mahabal, R. J. Brunner, R. R. Gal, S. Castro, R. R. de Carvalho, and S. C. Odewahn, “Searches for rare and new types of objects,” in Virtual Observatories of the Future, ser. Astronomical Society of the Pacific Conference Series, vol. 225, 2001. [34] T. Li, S. Ma, and M. Ogihara, “Document clustering via adaptive subspace iteration,” in Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM, July 2004, pp. 218–225. [35] C. C. Aggarwal, Data classification: algorithms and applications, 1st ed. Chap- man & Hall/CRC, July 2014. [36] C. C. Aggarwal and C. K. Reddy, Data Clustering: Algorithms and Applications, 1st ed. Chapman & Hall/CRC, 2013. [37] C. C. Aggarwal, Outlier Analysis. Springer International Publishing, Apr 2015, pp. 237–263. [38] J. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14, June 1967, pp. 281–297. [39] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, June 2010. 132 Bibliography [40] K. Fukunaga, Introduction to Statistical Pattern Recognition. San Diego, USA: Academic Press, 1990. [41] W. Pan, X. Shen, and B. Liu, “Cluster analysis: unsupervised learning via supervised learning with a non-convex penalty,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 1865–1889, July 2013. [42] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G. C. Tseng, “Evaluation and comparison of gene clustering methods in microarray analysis,” Bioinformatics, vol. 22, no. 19, pp. 2405–2412, Oct 2006. [43] Google scholar. [Online]. Available: https://scholar.google.com.au [44] F. Murtagh, “A survey of recent advances in hierarchical clustering algorithms,” The Computer Journal, vol. 26, no. 4, pp. 354–359, Nov 1983. [45] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. NJ, USA: PrenticeHall, Inc., 1988. [46] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, Sep 1999. [47] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on neural networks, vol. 16, no. 3, pp. 645–678, May 2005. [48] P. Berkhin, A Survey of Clustering Data Mining Techniques. Springer, 2006, pp. 25–71. [49] H.-P. Kriegel, P. Kröger, and A. Zimek, “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 3, no. 1, pp. 1–58, Mar 2009. 133 Bibliography [50] K. Sim, V. Gopalkrishnan, A. Zimek, and G. Cong, “A survey on enhanced subspace clustering,” Data Mining and Knowledge Discovery, vol. 26, no. 2, pp. 332– 397, Mar 2013. [51] D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Annals of Data Science, vol. 2, no. 2, pp. 165–193, 2015. [52] H.-S. Park and C.-H. Jun, “A simple and fast algorithm for k-medoids clustering,” Expert Systems with Applications, vol. 36, no. 2, pp. 3336–3341, Mar 2009. [53] L. Kaufman and P. J. Rousseeuw, Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, Sep 2009, vol. 344. [54] R. T. Ng and J. Han, “Clarans: a method for clustering objects for spatial data mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 5, pp. 1003–1016, Sep 2002. [55] C. C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high dimensional spaces,” SIGMOD Record, vol. 29, no. 2, pp. 70–81, May 2000. [56] K.-G. Woo, J.-H. Lee, M.-H. Kim, and Y.-J. Lee, “Findit: a fast and intelligent subspace clustering algorithm using dimension voting,” Information and Software Technology, vol. 46, no. 4, pp. 255–271, Mar 2004. [57] H.-P. H. Kriegel, P. Kröger, M. Renz, and S. Wurst, “A generic framework for efficient subspace clustering of high-dimensional data,” in Proceedings of the Fifth IEEE International Conference on Data Mining, Washington, DC, USA, Nov 2005, pp. 250–257. [58] K. Kailing, H.-P. Kriegel, and P. Kröger, “Density-connected subspace clustering for high-dimensional data,” in Proc. of SIAM International Conference on Data Mining, vol. 4, Apr 2004, pp. 246–256. 134 Bibliography [59] I. Assent, R. Krieger, E. Müller, and T. Seidl, “Inscy : Indexing subspace clusters with in-process-removal of redundancy,” in Eighth IEEE International Conference on Data Mining, Dec 2008, pp. 719–724. [60] H. Nagesh, S. Goil, and A. Choudhary, “Adaptive grids for clustering massive data sets,” Proceedings of the 1st SIAM International Conference on Data Mining, pp. 1–17, Apr 2001. [61] C.-H. Cheng, A. W. Fu, and Y. Zhang, “Entropy-based subspace clustering for mining numerical data,” in Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, Aug 1999, pp. 84–93. [62] E. Müller, I. Assent, S. Günnemann, and T. Seidl, “Scalable density-based subspace clustering,” in Proceedings of the 20th ACM International Conference on Information and Knowledge Management, ser. CIKM ’11. New York, USA: ACM, Oct 2011, pp. 1077–1086. [63] P. Erdös and J. Lehner, “The distribution of the number of summands in the partitions of a positive integer,” Duke Mathematical Journal, vol. 8, no. 2, pp. 335–345, 1941. [64] W. H. Payne and F. M. Ives, “Combination generators,” ACM Transactions on Mathematical Software (TOMS), vol. 5, no. 2, pp. 163–172, June 1979. [65] D. L. Kreher and D. R. Stinson, Combinatorial algorithms: generation, enumeration, and search. CRC press, Dec 1998, vol. 7. [66] F. Pedregosa, R. Weiss, and M. Brucher, “Scikit-learn : Machine learning in python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, Oct 2011. 135 Bibliography [67] E. Müller, S. Günnemann, I. Assent, T. Seidl, and I. Färber, “http://dme.rwthaachen.de/en/opensubspace/evaluation,” 2009. [68] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: An update,” pp. 10–18, Nov 2009. [69] M. Lichman, “Uci machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml [70] Github repository for subscale algorithm. [Online]. Available: https://github.com/ amkaur/subscale.git [71] E. Müller, S. Günnemann, I. Assent, T. Seidl, M. Emmanuel, and G. Stephan, “Evaluating clustering in subspace projections of high dimensional data,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1270–1281, Aug 2009. [72] S. Jahirabadkar and P. Kulkarni, “Algorithm to determine -distance parameter in density based clustering,” Expert Systems with Applications, vol. 41, no. 6, pp. 2939–2946, May 2014. [73] I. Assent, R. Krieger, E. Müller, and T. Seidl, “Dusc: Dimensionality unbiased subspace clustering,” in Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE Computer Society, Oct 2007, pp. 409–414. [74] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson, “Big data: Astronomical or genomical?” PLoS Biology, vol. 13, no. 7, pp. 1–11, July 2015. [75] A. S. Shirkhorshidi, S. Aghabozorgi, T. Y. Wah, and T. Herawan, “Big data clustering: a review,” in International Conference on Computational Science and Its Applications. Springer, June 2014, pp. 707–720. 136 Bibliography [76] V. Turner, J. F. Gantz, D. Reinsel, and S. Minton, “The digital universe of opportunities: Rich data and the increasing value of the internet of things,” IDC iView: IDC Analyze the future, pp. 1–10, Apr 2014. [77] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231– 1237, Aug 2013. [78] S. M. Bileschi, “Streetscenes: Towards scene understanding in still images,” Ph.D. dissertation, Massachusettes Institute of Technology, 2006. [79] J. Zhu, S. Liao, Z. Lei, D. Yi, and S. Z. Li, “Pedestrian attribute classification in surveillance: Database and evaluation,” in ICCV workshop on Large-Scale Video Search and Mining (LSVSM’13), Sydney, Australia, Dec 2013, pp. 331–338. [80] Github repository for scalable subscale algorithm. [Online]. Available: https: //github.com/amkaur/subscaleplus.git [81] A. Kaur and A. Datta, “Subscale: Fast and scalable subspace clustering for high dimensional data,” in IEEE International Conference on Data Mining Workshop, Dec 2014, pp. 621–628. [82] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory programming,” IEEE Computational Science Engineering, vol. 5, no. 1, pp. 46–55, Jan 1998. [83] I. T. Joliffe, Principle Component Analysis, 2nd ed. New York, USA: Springer, 2002. [84] B. Zhu, A. Mara, and A. Mozo, “Clus: Parallel subspace clustering algorithm on spark,” in New Trends in Databases and Information Systems, ser. Communications in Computer and Information Science. 2015, vol. 539, pp. 175–185. Springer International Publishing, Sep 137 Bibliography [85] T. Dasu and T. Johnson, Exploratory data mining and data cleaning. John Wiley & Sons, 2003, vol. 479. [86] E. Rahm and H. H. Do, “Data cleaning: Problems and current approaches,” IEEE Data Engineering Bulletin, vol. 23, no. 4, pp. 3–13, Dec 2000. [87] Y. W. Lee, L. L. Pipino, J. D. Funk, and R. Y. Wang, Journey to Data Quality. The MIT Press, Sep 2009. [88] W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee, “A taxonomy of dirty data,” Data Mining and Knowledge Discovery, vol. 7, no. 1, pp. 81–99, Jan 2003. [89] J. W. Osborne and A. Overbay, “The power of outliers (and why researchers should always check for them),” Practical assessment, research & evaluation, vol. 9, no. 6, pp. 1–12, Nov 2004. [90] T. C. Redman, “The impact of poor data quality on the typical enterprise,” ACM Communications, vol. 41, no. 2, pp. 79–82, Feb 1998. [91] A. Haug, F. Zachariassen, and D. Van Liempd, “The costs of poor data quality,” Journal of Industrial Engineering and Management, vol. 4, no. 2, pp. 168–193, July 2011. [92] L. P. English, “Information quality: Critical ingredient for national security,” Journal of Database Management, vol. 16, no. 1, pp. 18–32, Jan 2005. [93] O. of Inspector General, “Undeliverable as addressed mail,” United States Postal Service, Tech. Rep. MS-AR-14-006, 2014. [94] E. D. Quality, “The data quality benchmark report,” Experian Data Quality, pp. 1–10, Jan 2015. [95] H. C. Koh and G. Tan, “Data mining applications in healthcare,” Journal of Healthcare Information Management, vol. 19, no. 2, pp. 64–72, Jan 2011. 138 Bibliography [96] N. G. Weiskopf and C. Weng, “Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research,” Journal of the American Medical Informatics Association, vol. 20, no. 1, pp. 144–151, 2013. [97] W. Rosenberg and A. Donald, “Evidence based medicine: an approach to clinical problem-solving,” British Medical Journal, vol. 310, no. 6987, pp. 1122–1126, Apr 1995. [98] A. R. Feinstein and R. I. Horwitz, “Problems in the evidence of evidence-based medicine,” The American Journal of Medicine, vol. 103, no. 6, pp. 529–535, Dec 1997. [99] D. J. Berndt, J. W. Fisher, A. R. Hevner, and J. Studnicki, “Healthcare data warehousing and quality assurance,” Computer, vol. 34, no. 12, pp. 56–65, Dec 2001. [100] R. Y. Wang and D. M. Strong, “Beyond accuracy: What data quality means to data consumers,” Journal of management information systems, vol. 12, no. 4, pp. 5–33, Mar 1996. [101] M. Juran Joseph and A. Blanton Godfrey, Juran’s quality handbook. McGraw Hill, 1999. [102] T. C. Redman, Data Quality: The Field Guide. Digital press, 2001. [103] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques (Data-Centric Systems and Applications). Secaucus, NJ, USA: Springer- Verlag, 2006. [104] A. D. Chapman, “Principles of data quality,” Global Biodiversity Information Facility, Copenhagen, Tech. Rep., July 2005. [105] C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, “Methodologies for data quality assessment and improvement,” ACM computing surveys, vol. 41, no. 3, pp. 16:1–16:52, July 2009. 139 Bibliography [106] W. Fan and F. Geerts, Foundations of Data Quality Management. Morgan and Claypool, July 2012, vol. 4, no. 5. [107] J. I. Maletic and A. Marcus, “Data cleansing: Beyond integrity analysis,” in MIT Conference on Information Quality, Oct 2000, pp. 200–209. [108] J. Van den Broeck, S. A. Cunningham, R. Eeckels, and K. Herbst, “Data cleaning: Detecting, diagnosing, and editing data abnormalities,” PLoS Medicine, vol. 2, no. 10, pp. 966–970, Sep 2005. [109] P. Filzmoser, R. Maronna, and M. Werner, “Outlier identification in high dimensions,” Computational Statistics & Data Analysis, vol. 52, no. 3, pp. 1694–1711, Jan 2008. [110] V. Hodge and J. Austin, “A survey of outlier detection methodologies,” Artificial Intelligence Review, vol. 22, no. 2, pp. 85–126, Oct 2004. [111] D. M. Hawkins, Identification of Outliers, ser. Monographs on Applied Probability and Statistics. London: Chapman and Hall, May 1980, vol. 11. [112] V. Barnett and T. Lewis, Outliers in Statistical Data, 3rd ed. John Wiley & Sons, 1994. [113] A. Zimek, E. Schubert, and H.-P. Kriegel, “A survey on unsupervised outlier detection in high-dimensional numerical data,” Statistical Analysis and Data Mining, vol. 5, no. 5, pp. 363–387, Oct 2012. [114] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Computing Surveys, vol. 41, no. 3, pp. 15:1–15:58, July 2009. [115] E. M. Knox and R. T. Ng, “Algorithms for mining distance-based outliers in large datasets,” in Proceedings of the 24th International Conference on Very Large Data Bases. 403. San Francisco, USA: Morgan Kaufmann Publishers, Aug 1998, pp. 392– 140 Bibliography [116] T. Johnson, I. Kwok, and R. T. Ng, “Fast computation of 2-dimensional depth contours,” in Proceedings of 4th International Conference on Knowledge Discovery and Data Mining. New York, USA: American Association for Artificial Intelli- gence, Aug 1998, pp. 224–228. [117] S. Ramaswamy, R. Rastogi, K. Shim, and K. S. Ramaswamy, Sridhar, Rajeev rastogi, “Efficient algorithms for mining outliers from large data sets,” ACM SIGMOD Record, vol. 29, no. 2, pp. 427–438, May 2000. [118] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying densitybased local outliers,” ACM Sigmod Record, vol. 29, no. 2, pp. 1–12, May 2000. [119] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos, “Loci: Fast outlier detection using the local correlation integral,” in Proceedings of 19th International Conference on Data Engineering. IEEE, Mar 2003, pp. 315–326. [120] A. Ghoting, S. Parthasarathy, and M. E. Otey, “Fast mining of distance-based outliers in high-dimensional datasets,” Data Mining and Knowledge Discovery, vol. 16, no. 3, pp. 349–364, June 2008. [121] Y. Wang, S. Parthasarathy, and S. Tatikonda, “Locality sensitive outlier detection: A ranking driven approach,” Proceedings of 27th International Conference on Data Engineering, pp. 410–421, Apr 2011. [122] H.-P. Kriegel, M. S hubert, and A. Zimek, “Angle-based outlier detection in highdimensional data,” Proceeding of the 14th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pp. 444–452, Aug 2008. [123] I. Ruts and P. J. Rousseeuw, “Computing depth contours of bivariate point clouds,” Computational Statistics & Data Analysis, vol. 23, no. 1, pp. 153–168, Nov 1996. 141 Bibliography [124] E. Müller, M. Schiffer, and T. Seidl, “Statistical selection of relevant subspace projections for outlier ranking,” in IEEE 27th International Conference on Data Engineering, Apr 2011, pp. 434–445. [125] J. Zhang and H. Wang, “Detecting outlying subspaces for high-dimensional data: The new task, algorithms, and performance,” Knowledge and Information Systems, vol. 10, no. 3, pp. 333–355, Oct 2006. [126] F. Keller, E. Muller, and K. Bohm, “Hics: high contrast subspaces for density-based outlier ranking,” in IEEE 28th International Conference on Data Engineering, Apr 2012, pp. 1037–1048. [127] E. M. Knorr and R. T. Ng, “Finding intentional knowledge of distance-based outliers,” in Proceedings of 25th International Conference on Very Large Data Bases, vol. 99. Morgan Kaufmann Publishers, Sep 1999, pp. 211–222. [128] C. C. Aggarwal and P. S. Yu, “Outlier detection for high dimensional data,” ACM Sigmod Record, vol. 30, no. 2, pp. 37–46, May 2001. [129] J. Zhang, M. Lou, and T. Ling, “Hos-miner: a system for detecting outlyting subspaces of high-dimensional data,” in Proceedings of the 30th International Conference on Very Large Databases, vol. 30. VLDB Endowment, Aug 2004, pp. 1265–1268. [130] H. Kriegel, P. Kröger, E. Schubert, and A. Zimek, “Outlier detection in axis-parallel subspaces of high dimensional data,” in Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, vol. 1. Springer Berlin Heidelberg, Apr 2009, pp. 831–838. [131] M. A. Little, P. E. McSharry, S. J. Roberts, D. A. Costello, and I. M. Moroz, “Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection,” BioMedical Engineering OnLine, vol. 6, no. 1, pp. 1–19, June 2007.