Download Information-Theoretic Co

C0-clustering Liang Zheng Clustering • Clustering along “One-Dimension” – Grouping together of “similar” objects • Hard Clustering -- Each object belongs to a single cluster • Soft Clustering -- Each object is probabilistically assigned to clusters C0-clustering • Given a multi-dimensional data matrix, co-clustering refers to simultaneous clustering along multiple dimensions. Term 1 1  1  0 0  0 Doc  0  0 0  0  0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 2 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0  0 0  0 0  0 1  0  1 1 Doc-Term Co-occurrence Matrix Co-occurrence Matrices Characteristics • Data sparseness • High dimension • Noise Related Methods 1. • Information-Theoretic Co-Clustering (ITCC) • Graph-partitioning based Co-Clustering (GPCC) Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98. 2. Dhillon I S. Co-clustering documents and words using bipartite spectral graph partitioning[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2001: 269-274. Inderjit S. Dhillon Center for Big Data Analytics Department of Computer Science University of Texas at Austin Information-Theoretic Co-Clustering (ITCC) • View (scaled) co-occurrence matrix as a joint probability distribution between row & column random variables Y X • p( x, y)  # co  occurence( x, y)  # co  occurence( x, y) x, y Yˆ X̂ We seek a hard-clustering of both dimensions such that loss in “Mutual Information” I ( X , Y ) - I ( Xˆ , Yˆ ) is minimized given a fixed no. of row & col. clusters Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98. ITCC算法的核心 • 第一，选择矩阵重构后仍必须保证的原始数据矩阵中的统计量，即定义不变统计量 • 第二，选择恰当的距离度量方法，用来度量原始矩阵与协同聚类后压缩矩阵之间信息损失，即定义目标函数。 • ITCC选择矩阵行列的边缘分布作为不变统计量，相关熵（KLDivergence）作为聚类前后矩阵差异性的度量准则 ITCC算法相关分析 • NP难问题， ITCC算法得到的最终解是局部最优解 • 时间复杂度是O ( t* n*(k+l))，其中n 表示矩阵非零值的个数，k 表示行簇个数，l 表示列簇的个数，t 表示迭代的次数。 Graph-partitioning based Co-Clustering • Given disjoint document clusters D1, . . . ,Dk, the corresponding word clusters W1, . . . ,Wk may be determined as follows. • A given word wi belongs to the word cluster Wm if its association with the document cluster Dm is greater than its association with any other document cluster. • A natural measure of the association of a word with a document cluster is the sum of the edge-weights to all documents in the cluster. • Clearly the “best” word and document clustering would correspond to a partitioning of the graph such that the crossing edges between partitions have minimum weight. GPCC算法 • Adjacency Matrix  Laplacian matrix • This problem can be tackled with an SVD-related algorithm. • O(|E| + h logh) h=m+n (L=D-A) D: degree matrix Our Task Entity set Entity set l1 e1 e1 t1 l2 e2 e2 t2 . . . . . . . . . . . . lm eq eq tn Link set LE LET = types LE TE LT = Type set TE links 问题就可以描述为Link—Entity—Type协同聚类的问题：给定Link set L、Entity set E、Type set T、整数k 和关联强度函数 Relevance, 由协同聚类得到一个集合C 且Cx = (Lx, Tx) (1≤x≤k) ,使得 li Lx ,tjTy Relevance (li , tj)最小. (x, y= 1,...,k且x y) Definition li:  tx ; li关联到types tj:  ly; tj关联到links Relevance(li , tj)=  x=1,…,|li| Relevance(tx , tj) + ( 1- ) y=1,…,|tj| Relevance(li , ly) Relevance(tx , tj) =2 depth(LCS)/(depth(tx)+depth(tj)) Relevance(li , ly) =2 depth(LCS)/(depth(li)+depth(ly)) • Q&A

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Information-Theoretic Co