Download On K-Means Cluster Preservation using Quantization Schemes

On K-Means Cluster Preservation using Quantization Schemes Deepak Turaga1, Michalis Vlachos2, Olivier Verscheure1 1IBM T.J. Watson Research Center, NY, USA 2IBM Zürich Research Laboratory, Switzerland overview – what we want to do… • Examine under what conditions compression methodologies retain the clustering outcome • We focus on the K-Means algorithm cluster 1 cluster 2 cluster 3 cluster 1 cluster 2 identical clustering results k-Means original data k-Means quantized data cluster 3 why we want to do that… • Reduced Storage – The quantized data will take up less space why we want to do that… • Reduced Storage – The quantized data will take up less space • Faster execution – Since the data can be represented in a more compact form the cluster algorithm will require less runtime why we want to do that… • Reduced Storage – The quantized data will take up less space • Faster execution – Since the data can be represented in a more compact form the cluster algorithm will take less runtime • Anonymization/Privacy Preservation – The original values are not disclosed why we want to do that… • Reduced Storage – The quantized data will take up less space • Faster execution – Since the data can be represented in a more compact form the cluster algorithm will take less runtime • Anonymization/Privacy Preservation – The original values are not disclosed • Authentication – encode some message with the quantization We will achieve the above and still guarantee same results other cluster preservation techniques original • We do not transform into another space • Space requirements same – no data simplification • Shape preservation [Oliveira04] S. R. M. Oliveira and O. R. Zaane. Privacy Preservation When Sharing Data For Clustering, 2004 [Parameswaran05] R. Parameswaran and D. Blough. A Robust Data Obfuscation Approach for Privacy Preservation of Clustered Data, 2005 quantized k-means overview K-Means Algorithm: 1. Initialize k clusters (k specified by user) randomly. 2. Repeat until convergence 1. Assign each object to the nearest cluster center. 2. Re-estimate cluster centers. 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 k-means example 1.4 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.5 0 0.5 1 1.5 k-means applications/usage • Fast pre-clustering k-means applications/usage • Fast pre-clustering • Real-time clustering (eg image, video effects) – Color/Image segmentation k-means objective function • Objective: Mininize sum of intra-class variance Cluster centroid After some algebraic manipulations clusters Dimensions/Time instances 2nd moment 1st moment k-means objective function So we can preserve the k-Means outcome if: • We maintain the cluster assignment • We preserve the 1st and 2nd moment of the cluster objects clusters Dimensions/Time instances 2nd moment 1st moment moment preserving quantization • • • • 1st moment: average 2nd (central) moment : variance 3rd moment: skewness 4th moment: kyrtosis In order to preserve the first and second moment we will use the following quantizer:   Ng   N  Ng N  Ng Ng  Everything below the mean value is ‘snapped’ here Everything above the mean value is ‘snapped’ here original quantized -2.4240 -0.2238 0.0581 -0.4246 -0.2029 -1.5131 -1.1264 -0.8150 0.3666 -0.5861 1.5374 0.1401 -1.8628 -0.4542 -0.6521 0.1033 -0.2206 -0.2790 -0.7337 -0.0645 -1.4795 0.2049 0.2049 0.2049 0.2049 -1.4795 -1.4795 -1.4795 0.2049 -1.4795 0.2049 0.2049 -1.4795 0.2049 -1.4795 0.2049 0.2049 0.2049 -1.4795 0.2049 average = -0.4689   average = -0.4689     Ng N  Ng N  Ng Ng Ng = 0.2049   N  Ng Everything below the mean value is ‘snapped’ here = -1.4795 N  Ng Ng  Everything above the mean value is ‘snapped’ here These are the points for one dimension and for one cluster of objects. Dimension d (or time instance d) Process is repeated for all dimensions and for all clusters We have one quantizer per class our quantization • One quantizer per class • The quantized data are binary our quantization • The fact the we have 1 quantizer per class suggests that we need to run k-Means once before we quantize • This is not a shortcoming of the technique as we need to know the cluster boundaries so that we know how much we can simplify the data. why quantization works? • Why does the clustering remain same before and after quantization?   Centers do not change (averages remain same) why quantization works? • Why does the clustering remain same before and after quantization?     Centers do not change (averages remain same) Cluster assignment does not change because clusters ‘shrink’ due to quantization will it always work? • The results will be the same for datasets with well-formed clusters • Discrepancy of results means that clusters were not that dense recap • Use moment preserving quantization to preserve objective function • Due to cluster shrinkage, cluster assignments will not change • Identical results for optimal k-Means • One quantizer per class • 1-bit quantizer per dimension clusters Dimensions 2nd moment 1st moment example: shape preservation example: shape preservation example: shape preservation [Bagnall06] A. J. Bagnall, C. A. Ratanamahatana, E. J. Keogh, S. Lonardi, and G. J. Janacek. A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. In Data Min. Knowl. Discov. 13(1), pages 11–40, 2006. example: cluster preservation • 3 years Nasdaq stock ticker data • We cluster into k=8 clusters Confusion Matrix 3% mislabeled data 1 2 3 4 5 6 With Binary Clipping: 80% mislabeled 7 8 Cluster centers after the moment preserving quantization quantization levels indicate cluster spread     N  Ng Ng Ng N  Ng example: label preservation Acer platanoides Tilia Salix fragilis Quercus robur • 2 datasets – Contours of fish – Contours of leaves • Clustering and then k-NN voting For rotation invariance we use a rotation invariant features 35 5 30 4 25 3 20 2 15 10 1 0 5 0 20 40 60 0 0 10 20 30 example: label preservation • Very low mislabeling error for MPQ • High error rate for Binary Clipping other nice characteristics • Low sensitivity to initial centers – Mismatch when starting from different centers is around 7% other nice characteristics • Low sensitivity to initial centers – Mismatch when starting from different centers is around 7% • Neighborhood preservation – even though we are not optimizing directly that… – Good results because we are preserving the ‘shape’ of the object A B size reduction by a factor of 3 when using the quantized scheme • Compression reduces for increasing K summary • 1-bit quantizer per dimension sufficient to preserve kMeans ‘as well as possible’ • Theoretically the results will be identical (under conditions) • Good ‘shape’ preservation Future work: • Multi-bit quantization • Multi-dimension quantization end..

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download On K-Means Cluster Preservation using Quantization Schemes