Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
C0-clustering Liang Zheng Clustering • Clustering along “One-Dimension” – Grouping together of “similar” objects • Hard Clustering -- Each object belongs to a single cluster • Soft Clustering -- Each object is probabilistically assigned to clusters C0-clustering • Given a multi-dimensional data matrix, co-clustering refers to simultaneous clustering along multiple dimensions. Term 1 1 1 0 0 0 Doc 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 2 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 Doc-Term Co-occurrence Matrix Co-occurrence Matrices Characteristics • Data sparseness • High dimension • Noise Related Methods 1. • Information-Theoretic Co-Clustering (ITCC) • Graph-partitioning based Co-Clustering (GPCC) Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98. 2. Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001: 269-274. Inderjit S. Dhillon Center for Big Data Analytics Department of Computer Science University of Texas at Austin Information-Theoretic Co-Clustering (ITCC) • View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables Y X • p( x, y) # co occurence( x, y) # co occurence( x, y) x, y Yˆ X̂ We seek a hard-clustering of both dimensions such that loss in “Mutual Information” I ( X , Y ) - I ( Xˆ , Yˆ ) is minimized given a fixed no. of row & col. clusters Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98. ITCC算法的核心 • 第一,选择矩阵重构后仍必须保证的原始数据矩阵中的统计量,即定 义不变统计量 • 第二,选择恰当的距离度量方法,用来度量原始矩阵与协同聚类后压 缩矩阵之间信息损失,即定义目标函数。 • ITCC选择矩阵行列的边缘分布作为不变统计量, 相关熵(KLDivergence)作为聚类前后矩阵差异性的度量准则 ITCC算法相关分析 • NP难问题, ITCC算法得到的最终解是局部最优解 • 时间复杂度是O ( t* n*(k+l)),其中n 表示矩阵非零值的个数,k 表示 行簇个数,l 表示列簇的个数,t 表示迭代的次数。 Graph-partitioning based Co-Clustering • Given disjoint document clusters D1, . . . ,Dk, the corresponding word clusters W1, . . . ,Wk may be determined as follows. • A given word wi belongs to the word cluster Wm if its association with the document cluster Dm is greater than its association with any other document cluster. • A natural measure of the association of a word with a document cluster is the sum of the edge-weights to all documents in the cluster. • Clearly the “best” word and document clustering would correspond to a partitioning of the graph such that the crossing edges between partitions have minimum weight. GPCC算法 • Adjacency Matrix Laplacian matrix • This problem can be tackled with an SVD-related algorithm. • O(|E| + h logh) h=m+n (L=D-A) D: degree matrix Our Task Entity set Entity set l1 e1 e1 t1 l2 e2 e2 t2 . . . . . . . . . . . . lm eq eq tn Link set LE LET = types LE TE LT = Type set TE links 问题就可以描述为Link—Entity—Type协同聚类的问题: 给定Link set L、Entity set E、Type set T、整数k 和关联强度函数 Relevance, 由协同聚类得到一个集合C 且Cx = (Lx, Tx) (1≤x≤k) ,使得 li Lx ,tjTy Relevance (li , tj)最小. (x, y= 1,...,k且x y) Definition li: tx ; li关联到types tj: ly; tj关联到links Relevance(li , tj)= x=1,…,|li| Relevance(tx , tj) + ( 1- ) y=1,…,|tj| Relevance(li , ly) Relevance(tx , tj) =2 depth(LCS)/(depth(tx)+depth(tj)) Relevance(li , ly) =2 depth(LCS)/(depth(li)+depth(ly)) • Q&A