Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
-Clusters Capturing Subspace Correlation in a Large Data Set Authors: Yang Jiong, Wei Wang etc.(ICDE02) Presenter: Xuehua Shen [email protected] May 22, 2017 Data Mining: Concepts and Techniques 1 Presentation Layout Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm May 22, 2017 Data Mining: Concepts and Techniques 2 Clustering Clustering: the process of grouping a set of objects into classes of similar objects Similar to one another within the same cluster Dissimilar to the objects in other clusters May 22, 2017 Data Mining: Concepts and Techniques 3 Major Clustering Methods Partition algorithm Hierarchy algorithm Density-based Grid-based Model-based May 22, 2017 Data Mining: Concepts and Techniques 4 Similarity Clustering: the process of grouping a set of objects into classes of similar objects But how to define similarity? May 22, 2017 Data Mining: Concepts and Techniques 5 Similarity cont. Traditional clustering model: based on distance functions Some popular ones include: Minkowski distance: d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer But strong correlations may still exist among a set of objects even if they are far apart from each other as measured by the distance function May 22, 2017 Data Mining: Concepts and Techniques 6 Similarity cont. -Clusters model: similar when exhibiting a coherent pattern on a subset of dimensions Can cluster objects which show shifting pattern or scaling pattern May 22, 2017 Data Mining: Concepts and Techniques 7 Similarity cont. Example of Coherent Pattern: Shifting Pattern Scaling Pattern May 22, 2017 Data Mining: Concepts and Techniques 8 Subspace Clustering From high dimensional clustering (problematic) To subspace clustering Not restricted with fixed ordering of columns contrasted with pattern in time-series data Challenge: curse of dimensionality! May 22, 2017 Data Mining: Concepts and Techniques 9 Subspace Clustering cont. Example of subspace clustering CH11 CH1B CH1D CH2I CH2B CTFC3 4392 284 4108 280 228 VPS8 401 281 120 275 298 EFB1 318 280 37 277 215 SSA1 401 292 109 580 238 FUN14 2857 285 2576 271 226 SP07 228 290 48 285 224 MDM1 0 538 272 266 277 236 CYS3 322 288 41 278 219 May 22, 2017 CH11 CH1D CH2B VPS8 401 120 298 EFB1 318 37 215 CYS3 322 41 219 Data Mining: Concepts and Techniques 10 Applications Microarray Data Analysis in Biology E-Commerce May 22, 2017 Data Mining: Concepts and Techniques 11 Microarray Data Analysis Matrix (Dense) Rows: Genes Columns: Various Samples experiment conditions or tissues Values in Matrix: expression level relative abundance of the mRNA of a gene under a specific condition May 22, 2017 Data Mining: Concepts and Techniques 12 Microarray Data Analysis cont. From Scaling Pattern to Shifting Pattern dij log( Re dIntensity GreenIntensity ) Red: Interested Gene, Green: Controlled Gene Investigations show that several genes contribute to a disease, which motivates researchers to identify a subset of genes whose expression levels rise and fall coherently under a subset of conditions May 22, 2017 Data Mining: Concepts and Techniques 13 E-Commerce Example: Rating of Movies (1: lowest rate, 10: highest rate) Movie 1 Movie 2 Movie 3 Movie 4 Viewer 1 1 2 3 6 Viewer 2 Viewer 3 2 3 4 7 4 5 6 9 Shifting Pattern If a new movies and 1st viewer rate 7 and 3rd viewer rate 9, 2nd viewer probably will like this movie too May 22, 2017 Data Mining: Concepts and Techniques 14 Presentation Layout Overview of clustering Related Work of -Clusters -Clusters Model FLOC algorithm May 22, 2017 Data Mining: Concepts and Techniques 15 Related Work CLIQUE, ORCLUS, PROCLUS (subspace clustering) Can’t capture neither the shifting pattern nor the scaling pattern Bicluster model proposed as a measure of coherence of genes and conditions in a submatrix of a DNA array May 22, 2017 Data Mining: Concepts and Techniques 16 Bicluster Model: Mean squared residue score of submatrix: H ( I , J ) |I ||1J | ( 2 ( d d d d ) ij iJ Ij IJ iI , jJ d iJ |1J | d ij , d Ij |1I | d ij , d IJ |I ||1J | jJ iI d iI , jJ ij a submatrix AIJ is called a -biCluster if H(I,J) Algorithm: A random algorithm to give an approximate answer May 22, 2017 Data Mining: Concepts and Techniques 17 Weakness of bicluster Missing Values Constraints May 22, 2017 Data Mining: Concepts and Techniques 18 Presentation Layout Overview Related Work -Clusters Model FLOC algorithm May 22, 2017 Data Mining: Concepts and Techniques 19 Occupancy Threshold A parameter to control the percentage of missing values in a submatrix J i' J |J’i| is the specified attributes for object i in Clusters |J| is the number of attributes in the -Clusters May 22, 2017 Data Mining: Concepts and Techniques 20 Occupancy Threshold cont. Similar occupancy threshold for attribute j in Clusters Example =0.6 1 3 4 3 May 22, 2017 5 4 1 3 3 4 3 4 Data Mining: Concepts and Techniques 3 5 4 21 Volume The volume of a -Clusters(I,J) is the number of specified entries dij in (I,J) Example volume is 3*3=9 1 3 May 22, 2017 3 4 3 4 3 5 4 Data Mining: Concepts and Techniques 22 Base Object Base di,J jJ ' d ij J i' Attribute Base dI , j May 22, 2017 iI ' ' j dij I Data Mining: Concepts and Techniques 23 Base cont. -Clusters Base d IJ iI , jJ d ij vIJ For perfect -Clusters d ij d iJ d Ij d IJ May 22, 2017 Data Mining: Concepts and Techniques 24 Residue Entry Residue if dij is specified rij d ij d iJ d Ij d IJ otherwise is 0 May 22, 2017 Data Mining: Concepts and Techniques 25 Residue cont. -Clusters Residue iI , jJ rij vIJ r-residue -Clusters if -clusters residue is equal to or smaller than r May 22, 2017 Data Mining: Concepts and Techniques 26 Presentation Layout Overview of Clustering Related Work of -Clusters -Clusters Model FLOC algorithm(Flexible Overlapping Clustering) May 22, 2017 Data Mining: Concepts and Techniques 27 Flow Chart Generating initial clusters Determine the best action For each row and each column Perform the best action sequentially improved N May 22, 2017 Y Data Mining: Concepts and Techniques 28 Initial Cluster Randomly Generate k initial cluster Different parameters makes different size cluster May 22, 2017 Data Mining: Concepts and Techniques 29 Choose best actions For every object or attribute, there are k actions which can be done, Choose the best action among the k candidates according to gain Gain is the difference between original residue and the residue assuming the action is done on the cluster May 22, 2017 Data Mining: Concepts and Techniques 30 Choose Best Actions cont. Even if gain is negative sometimes we do the action in order to get the global optimum May 22, 2017 Data Mining: Concepts and Techniques 31 Do the actions sequentially Generate the actions sequence 1) the same order in all iterations 2) random order sequence 3) weighted random order sequence May 22, 2017 Data Mining: Concepts and Techniques 32 Output the Best cluster After some iterations, no improvement of minimum residue, algorithm stops and k best cluster is output May 22, 2017 Data Mining: Concepts and Techniques 33 End Thank you! May 22, 2017 Data Mining: Concepts and Techniques 34