Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas Model-Based Clustering What is model-based clustering? Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution Typical methods Statistical approach EM (Expectation maximization), AutoClass Machine learning approach COBWEB, CLASSIT Neural network approach SOM (Self-Organizing Feature Map) 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 EM — Expectation Maximization EM — A popular iterative refinement algorithm An extension to k-means Assign each object to a cluster according to a weight (prob. distribution) New means are computed based on weighted measures General idea Starts with an initial estimate of the parameter vector Iteratively rescores the patterns against the mixture density produced by the parameter vector The rescored patterns are used to update the parameter updates Patterns belonging to the same cluster, if they are placed by their scores in a particular component Algorithm converges fast but may not be in global optima AutoClass (Cheeseman and Stutz, 1996) 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 1D Guassian Mixture Model Given a set of data distributed in a 1D space, how to perform clustering in the data set? General idea: factorize the p.d.f. into a mixture of simple models. Discrete values: Bernoulli distribution Continues values: Gaussian distribution 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 The EM (Expectation Maximization) Algorithm Initially, randomly assign k cluster centers Iteratively refine the clusters based on two steps Expectation step: assign each data point Xi to cluster Ci with the following probability Maximization step: Estimation of model parameters k xi * ( xi , Ck ) / ( xi , Ck ) 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 Another Way of K-mean? Pos: AutoClass can adapt to different (convex) shapes of clusters, kmean assumes spheres Solid statistics foundation Cons: computational expensive 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 Model Based Subspace Clustering Microarray Bi-clustering δ-clustering p-clustering OP-clustering 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 MicroArray Dataset 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 x11 ... xi1 ... xn1 ... x1 j ... x1m ... ... ... ... ... xij ... xim ... ... ... ... ... xnj ... xnm Genes Genes Gene Expression Matrix Conditions Time points Cancer Tissues 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Conditions slide9 Data Mining: Clustering k dist ( x t 1 ict K-means clustering minimizes Where dist ( x , c i 10/04/2006 Model-based Clustering t ) m (x j 1 ij i , ct ) 2 ctj ) 2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 Clusters Are Clear After Projection 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 Motivation DNA microarray analysis CH1I CH1B CH1D CH2I CH2B CTFC3 4392 284 4108 280 228 VPS8 401 281 120 275 298 EFB1 318 280 37 277 215 SSA1 401 292 109 580 238 FUN14 2857 285 2576 271 226 SP07 228 290 48 285 224 MDM10 538 272 266 277 236 CYS3 322 288 41 278 219 DEP1 312 272 40 273 232 NTG1 329 296 33 274 228 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 strength Motivation 450 400 350 300 250 200 150 100 50 0 CH1I CH1D CH2B condition 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 Previous Work Subspace clustering Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute. (o o )(o o ) (o o ) ( o o ) 1 1 2 2 2 1 10/04/2006 Model-based Clustering 1 2 2 2 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 bi-cluster Terms Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bicluster) Biclustering of Expression Data, Cheng & Church ISMB’00 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 bi-cluster CH1I CH1B CH1D CH2I CH2B Obj base CTFC3 VPS8 401 120 298 273 EFB1 318 37 215 190 322 41 219 194 347 66 244 219 SSA1 FUN14 SP07 MDM10 CYS3 DEP1 NTG1 Attr base 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 17 conditions 1 69 69 110 110 179 0 0 161 110 0 161 179 240 161 283 195 161 326 69 208 383 161 195 230 300 240 442 230 322 195 277 470 248 139 411 220 271 271 195 318 2 0 69 0 0 110 0 0 69 69 69 110 161 161 69 240 110 139 304 0 220 383 220 220 330 289 289 464 208 322 161 256 498 240 110 397 220 248 230 110 195 3 69 69 69 69 69 69 0 69 110 69 110 139 195 110 248 139 161 322 69 277 413 195 161 300 264 264 456 240 300 139 248 488 289 69 383 220 230 208 161 271 Model-based Clustering 4 139 110 69 110 110 69 110 69 110 110 139 161 195 139 264 195 139 326 110 289 414 161 139 277 277 240 451 230 330 161 264 477 300 69 371 208 240 161 139 195 5 139 110 110 110 110 139 110 110 161 139 179 195 256 161 304 248 179 350 110 326 403 195 161 240 277 256 422 248 356 139 271 460 294 110 347 208 248 195 179 304 6 139 110 139 110 110 161 110 69 110 110 139 161 220 139 283 179 161 340 69 289 381 161 161 240 289 220 417 240 361 161 248 466 289 69 314 161 240 161 161 289 7 139 110 139 139 161 179 69 110 69 0 110 161 208 110 283 161 139 376 0 289 393 110 110 179 277 208 403 283 333 139 240 484 264 69 277 161 179 161 179 Mining 283 8 9 69 0 69 0 139 0 110 0 161 0 139 0 110 0 110 0 139 69 0 0 139 69 161 110 240 139 161 69 283 195 220 110 69 69 318 248 69 0 248 220 343 350 110 110 139 110 195 220 300 248 220 248 432 510 248 220 369 376 179 110 256 220 449 532 277 248 69 69 330 264 208 179 248 208 195 161 161 69 Biological 289 304 10 0 69 69 69 69 69 0 0 69 69 69 161 195 110 220 179 139 314 69 271 369 195 195 277 283 271 438 230 369 110 230 485 283 69 289 195 208 208 110 Data 330 11 69 69 69 110 69 0 0 69 110 69 69 161 195 139 240 195 69 283 69 240 358 179 195 289 271 256 442 230 374 139 230 473 283 69 283 179 220 195 139 264 KU EECS 800, Luke Huan, Fall’06 12 110 110 139 139 110 110 69 0 110 110 110 139 195 69 240 161 69 314 139 271 347 179 195 240 294 256 450 220 369 139 256 464 277 69 304 179 230 220 139 256 13 0 110 69 69 0 0 0 0 139 69 110 139 161 69 240 179 179 318 69 294 358 69 69 240 256 240 462 240 343 139 208 487 283 69 264 161 220 161 139 271 14 69 69 69 69 69 69 69 69 110 69 110 161 195 110 248 208 179 326 69 277 356 139 161 220 264 220 419 248 361 110 208 477 277 110 264 139 179 179 161 309 15 0 0 0 0 0 0 0 0 69 0 69 110 161 69 195 110 110 264 0 230 314 110 139 161 271 179 476 220 393 161 240 492 271 69 340 161 230 195 139 277 16 0 YBL069W 69 YBL097W 0 YBR064W 69 YBR065C 69 YBR114W 69 YCL013W 69 YDR149C 69 YDR461W 110 YDR526C 0 YHR061C 69 YIL092W 110 YIR043C 110 YJL010C 69 YJL023C 208 YJL033W 110 YJL076W 69 YJR162C 264 YKL068W 0 YKL134C 208 YLR219W 289 YLR380W 110 YLR381W 139 YLR382C 161 YLR383W 283 YLR384C 208 YLR386W 476 YLR388W 240 YLR392C 399 YLR395C 161 YLR400W 230 YLR401C 484 YLR406C 283 YLR408C 69 YLR411W 343 YLR413W 139 YLR450W 230 YLR451W 220 YLR452C 161 YLR453C 256 YLR454W 40 genes 0 139 0 139 139 208 0 0 179 69 69 139 179 179 161 208 161 139 304 69 283 337 161 208 248 264 230 439 256 374 139 230 494 326 179 326 161 220 220 179 283 10/04/2006 slide21 Motivation expression level 600 500 400 300 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 condition 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 17 conditions 1 69 69 110 110 179 0 0 161 110 0 161 179 240 161 283 195 161 326 69 208 383 161 195 230 300 240 442 230 322 195 277 470 248 139 411 220 271 271 195 318 2 0 69 0 0 110 0 0 69 69 69 110 161 161 69 240 110 139 304 0 220 383 220 220 330 289 289 464 208 322 161 256 498 240 110 397 220 248 230 110 195 3 69 69 69 69 69 69 0 69 110 69 110 139 195 110 248 139 161 322 69 277 413 195 161 300 264 264 456 240 300 139 248 488 289 69 383 220 230 208 161 271 Model-based Clustering 4 139 110 69 110 110 69 110 69 110 110 139 161 195 139 264 195 139 326 110 289 414 161 139 277 277 240 451 230 330 161 264 477 300 69 371 208 240 161 139 195 5 139 110 110 110 110 139 110 110 161 139 179 195 256 161 304 248 179 350 110 326 403 195 161 240 277 256 422 248 356 139 271 460 294 110 347 208 248 195 179 304 6 139 110 139 110 110 161 110 69 110 110 139 161 220 139 283 179 161 340 69 289 381 161 161 240 289 220 417 240 361 161 248 466 289 69 314 161 240 161 161 289 7 139 110 139 139 161 179 69 110 69 0 110 161 208 110 283 161 139 376 0 289 393 110 110 179 277 208 403 283 333 139 240 484 264 69 277 161 179 161 179 Mining 283 8 9 69 0 69 0 139 0 110 0 161 0 139 0 110 0 110 0 139 69 0 0 139 69 161 110 240 139 161 69 283 195 220 110 69 69 318 248 69 0 248 220 343 350 110 110 139 110 195 220 300 248 220 248 432 510 248 220 369 376 179 110 256 220 449 532 277 248 69 69 330 264 208 179 248 208 195 161 161 69 Biological 289 304 10 0 69 69 69 69 69 0 0 69 69 69 161 195 110 220 179 139 314 69 271 369 195 195 277 283 271 438 230 369 110 230 485 283 69 289 195 208 208 110 Data 330 11 69 69 69 110 69 0 0 69 110 69 69 161 195 139 240 195 69 283 69 240 358 179 195 289 271 256 442 230 374 139 230 473 283 69 283 179 220 195 139 264 KU EECS 800, Luke Huan, Fall’06 12 110 110 139 139 110 110 69 0 110 110 110 139 195 69 240 161 69 314 139 271 347 179 195 240 294 256 450 220 369 139 256 464 277 69 304 179 230 220 139 256 13 0 110 69 69 0 0 0 0 139 69 110 139 161 69 240 179 179 318 69 294 358 69 69 240 256 240 462 240 343 139 208 487 283 69 264 161 220 161 139 271 14 69 69 69 69 69 69 69 69 110 69 110 161 195 110 248 208 179 326 69 277 356 139 161 220 264 220 419 248 361 110 208 477 277 110 264 139 179 179 161 309 15 0 0 0 0 0 0 0 0 69 0 69 110 161 69 195 110 110 264 0 230 314 110 139 161 271 179 476 220 393 161 240 492 271 69 340 161 230 195 139 277 16 0 YBL069W 69 YBL097W 0 YBR064W 69 YBR065C 69 YBR114W 69 YCL013W 69 YDR149C 69 YDR461W 110 YDR526C 0 YHR061C 69 YIL092W 110 YIR043C 110 YJL010C 69 YJL023C 208 YJL033W 110 YJL076W 69 YJR162C 264 YKL068W 0 YKL134C 208 YLR219W 289 YLR380W 110 YLR381W 139 YLR382C 161 YLR383W 283 YLR384C 208 YLR386W 476 YLR388W 240 YLR392C 399 YLR395C 161 YLR400W 230 YLR401C 484 YLR406C 283 YLR408C 69 YLR411W 343 YLR413W 139 YLR450W 230 YLR451W 220 YLR452C 161 YLR453C 256 YLR454W 40 genes 0 139 0 139 139 208 0 0 179 69 69 139 179 179 161 208 161 139 304 69 283 337 161 208 248 264 230 439 256 374 139 230 494 326 179 326 161 220 220 179 283 10/04/2006 slide23 Motivation 600 expression level 500 400 300 200 100 0 3 5 9 14 15 YBL069W YBL097W YBR064W YBR065C YBR114W YCL013W YDR149C YDR461W YDR526C YHR061C YIL092W YIR043C YJL010C YJL023C YJL033W YJL076W YJR162C YKL068W YKL134C YLR219W condition 10/04/2006 Model-based Clustering Co-regulated genes Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 bi-cluster Perfect -cluster d ij d iJ d Ij d IJ d ij d Ij d iJ d IJ d ij d iJ d Ij d IJ diJ dij dIJ Imperfect -cluster dIj Residue: rij 0, 10/04/2006 Model-based Clustering d ij d iJ d Ij d IJ , d ij is specified d ij is unspecifie d Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify -clusters with residue smaller than a given threshold 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column 1 2 3 4 1 3 4 2 2 2 1 3 2 3 3 4 2 0 4 row N=3 10/04/2006 Model-based Clustering M=4 M+N actions are Performed at each iteration Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp) 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = 10.34 The average residue of clusters found via the state of the art method in computational biology field is 12.54 The average volume is 25% bigger The response time is an order of magnitude faster 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free). 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 p-Clustering: Clustering by Pattern Similarity Given object x, y in O and features a, b in T, pCluster is a 2 by 2 matrix d d pScore( ) | (d xa d xb ) (d ya d yb ) | d d ya yb xa xb A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some δ > 0 For scaling patterns, one can observe, taking logarithmic on d xa / d ya d xb / d yb will lead to the pScore form H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 Coherent Cluster Want to accommodate noises but not outliers 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 Coherent Cluster Coherent cluster Subspace clustering pair-wise disparity For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} d xa D d ya d xb d yb dxa (d xa d ya ) (d xb d yb ) mutual bias of attribute a 10/04/2006 Model-based Clustering x d z ya y mutual bias of attribute b Mining Biological Data KU EECS 800, Luke Huan, Fall’06 dxb dyb a b attribute slide36 Coherent Cluster A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster. A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster. Objective: given a data matrix and a threshold , find all maximum -coherent clusters. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 Coherent Cluster Challenges: Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. The actual values of the objects in a coherent cluster may be far apart from each other. Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Two-way Pruning Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 Coherent Cluster Observation: Given a pair of objects {o1, o2} and a (sub)set of attributes {a1, a2, …, ak}, the 2k submatrix is a -coherent cluster iff, for every attribute ai, the mutual bias (do1ai – do2ai) does not differ from each other by more than . 7 5 o1 3 o2 1 a1 a2 3 2 3.5 2 2.5 [2, 3.5] 10/04/2006 Model-based Clustering a3 a4 If = 1.5, then {a1,a2,a3,a4,a5} is a coherent attribute set (CAS) of (o1,o2). a5 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 Coherent Cluster Observation: given a subset of objects {o1, o2, …, ol} and a subset of attributes {a1, a2, …, ak}, the lk submatrix is a -coherent cluster iff {a1, a2, …, ak} is a coherent attribute set for every pair of objects (oi,oj) where 1 i, j l. a1 a2 a3 a4 a5 a6 a7 o1 o2 o3 o4 o5 o6 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold . 7 7 5 r1 5 r1 3 r2 3 r2 1 a1 a2 a3 a4 a5 1 a2 a4 a5 a1 a3 2 2 2.5 3 3.5 = 1 3 2 3.5 2 2.5 The maximum coherent attribute sets define the search space for maximum coherent clusters. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 Two Way Pruning a0 a1 a2 o0 1 4 2 o1 2 5 5 o2 3 6 5 o3 4 200 7 o4 300 7 6 delta=1 nc =3 nr = 3 10/04/2006 Model-based Clustering (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) MCAS (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) MCOS Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 Coherent Cluster 10/04/2006 Model-based Clustering attributes objects Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 a0 a1 a2 a3 o0 1 4 2 5 o1 2 5 5 8 o2 3 6 5 7 o3 4 20 7 2 o4 30 7 6 6 {a0,a1} : (o0,o1) (o1,o2) (o0,o2) {a0,a2} : (o1,o3),(o2,o3) (o1,o2) (o0,o2) {a1,a2} : (o0,o4),(o1,o4),(o2,o4) (o1,o2) (o0,o2) {a2,a3} : (o0,o1),(o1,o2) (o0,o2) {a0,a1,a2} : (o1,o2) (o0,o2) {a0,a1,a2,a3} : (o0,o2) a0 a1 assume = 1 (o0,o1) : {a0,a1}, {a2,a3} (o0,o2) : {a0,a1,a2,a3} (o0,o4) : {a1,a2} a1 (o0,o1) (o1,o2) : {a0,a1,a2}, {a2,a3} a2 a2 a2 (o1,o3) (o2,o3) (o1,o3) : {a0,a2} (o1,o4) : {a1,a2} (o2,o3) : {a0,a2} a2 a3 (o0,o4) (o1,o4) (o2,o4) (o0,o1) (o1,o2) (o1,o2) a3 (o2,o4) : {a1,a2} 10/04/2006 Model-based Clustering Mining Data (o0,o2Biological ) KU EECS 800, Luke Huan, Fall’06 slide45 Coherent Cluster High expressive power The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications Gene expression analysis Collaborative filtering a ve ra ge re sponse time (se c) 12000 10000 8000 6000 subspace cluster 4000 2000 coherent cluster 0 10 20 50 100 200 500 number of conditions 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 Remark Comparing to Bicluster Can well separate noises and outliers No random data insertion and replacement Produce optimal solution 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 Let I be a subset of genes in the database. Let J be a subset of conditions. We say <I, J> forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. j1 , j2 J , j1 j2 (1)i I , Dij1 Dij2 Experssion Levels Definition of OP-Cluster A1 (2)i I , Dij1 Dij2 A2 A3 A4 (3)i I , Dij1 Dij2 max | Dij1 Dij2 | min (| Dij1 |, | Dij2 |) j1 , j2 J 10/04/2006 Model-based Clustering when Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 Problem Statement Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold nc and nr. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 Conversion to Sequence Mining Problem (1) Dij1 Dij2 j1 j2 (2) Dij1 Dij2 j1 j2 Experssion Levels (3) Dij1 Dij2 CanonicalO rder ( j1 , j2 ) Sequence: A1 A4 A3 A2 A1 10/04/2006 Model-based Clustering A2 A3 A4 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 Ming OP-Clusters: A naïve approach root A naïve approach Enumerate all possible subsequences in a prefix tree. For each subsequences, collect all genes that contain the subsequences. a b b c d c a d c … d Challenge: The total number of distinct subsequences are i i! 1i N m 10/04/2006 Model-based Clustering c d b d b c c d a d … d c d b c b d c d a … A Complete Prefix Tree with 4 items {a,b,c,d} Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequenes only occurring in the original database. Strategies: g1 adbc g2 abdc g3 badc 1. Depth-First Traversal Root 2. Suffix concatenation: Visit subsequences that only exist in the input sequences. 3. Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. a:1,2,3 a:1,2 d:1,2,3 d:1,3 d:1 b:1 c:1,2,3 c:1,3 c:1 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 b:3 b:2 a:3 d:2 d:3 c:2 c:3 slide52 References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03. 10/04/2006 Model-based Clustering Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53