Download Information-Theoretic Co

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
C0-clustering
Liang Zheng
Clustering
•
Clustering along “One-Dimension”
– Grouping together of “similar” objects
• Hard Clustering -- Each object belongs to a single cluster
• Soft Clustering -- Each object is probabilistically assigned to
clusters
C0-clustering
•
Given a multi-dimensional data matrix, co-clustering refers to
simultaneous clustering along multiple dimensions.
Term
1
1

1

0
0

0
Doc 
0

0
0

0

0
0
0
0
1
1
1
1
1
0
1
0
0
0
0
1
0
1
1
0
0
1
0
0
0
0
1
0
0
0
2
0
0
1
0
0
0
0
0
0
0
1
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0

0
0

0
0

0
1

0

1
1
Doc-Term Co-occurrence Matrix
Co-occurrence Matrices Characteristics
• Data sparseness
• High dimension
• Noise
Related Methods
1.
•
Information-Theoretic Co-Clustering (ITCC)
•
Graph-partitioning based Co-Clustering (GPCC)
Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth
ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98.
2.
Dhillon I S. Co-clustering documents and words using bipartite spectral graph
partitioning[C]//Proceedings of the seventh ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2001: 269-274.
Inderjit S. Dhillon
Center for Big Data Analytics
Department of Computer Science
University of Texas at Austin
Information-Theoretic Co-Clustering (ITCC)
•
View (scaled) co-occurrence matrix as a joint probability distribution between
row & column random variables
Y
X
•
p( x, y) 
# co  occurence( x, y)
 # co  occurence( x, y)
x, y
Yˆ
X̂
We seek a hard-clustering of both dimensions such that loss in “Mutual
Information”
I ( X , Y ) - I ( Xˆ , Yˆ )
is minimized given a fixed no. of row & col. clusters
Dhillon I S, Mallela S, Modha D S. Information-theoretic co-clustering[C]//Proceedings of the ninth ACM
SIGKDD international conference on Knowledge discovery and data mining. ACM, 2003: 89-98.
ITCC算法的核心
•
第一,选择矩阵重构后仍必须保证的原始数据矩阵中的统计量,即定
义不变统计量
•
第二,选择恰当的距离度量方法,用来度量原始矩阵与协同聚类后压
缩矩阵之间信息损失,即定义目标函数。
•
ITCC选择矩阵行列的边缘分布作为不变统计量, 相关熵(KLDivergence)作为聚类前后矩阵差异性的度量准则
ITCC算法相关分析
•
NP难问题, ITCC算法得到的最终解是局部最优解
•
时间复杂度是O ( t* n*(k+l)),其中n 表示矩阵非零值的个数,k 表示
行簇个数,l 表示列簇的个数,t 表示迭代的次数。
Graph-partitioning based Co-Clustering
•
Given disjoint document clusters D1, . . . ,Dk, the corresponding word
clusters W1, . . . ,Wk may be determined as follows.
•
A given word wi belongs to the word cluster Wm if its association
with the document cluster Dm is greater than its association with any
other document cluster.
•
A natural measure of the association of a word with a document
cluster is the sum of the edge-weights to all documents in the cluster.
•
Clearly the “best” word and document clustering would correspond to
a partitioning of the graph such that the crossing edges between
partitions have minimum weight.
GPCC算法
•
Adjacency Matrix  Laplacian matrix
•
This problem can be tackled with an SVD-related algorithm.
•
O(|E| + h logh) h=m+n
(L=D-A) D: degree matrix
Our Task
Entity set
Entity set
l1
e1
e1
t1
l2
e2
e2
t2
.
.
.
.
.
.
.
.
.
.
.
.
lm
eq
eq
tn
Link set
LE
LET =
types
LE
TE
LT =
Type set
TE
links
问题就可以描述为Link—Entity—Type协同聚类的问题:
给定Link set L、Entity set E、Type set T、整数k 和关联强度函数 Relevance,
由协同聚类得到一个集合C 且Cx = (Lx, Tx) (1≤x≤k) ,使得
li Lx ,tjTy Relevance (li , tj)最小. (x, y= 1,...,k且x y)
Definition
li:  tx ; li关联到types
tj:  ly; tj关联到links
Relevance(li , tj)=  x=1,…,|li| Relevance(tx , tj) + ( 1- ) y=1,…,|tj| Relevance(li , ly)
Relevance(tx , tj) =2 depth(LCS)/(depth(tx)+depth(tj))
Relevance(li , ly) =2 depth(LCS)/(depth(li)+depth(ly))
• Q&A