Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Graph Data Marina Meila University of Washington Department of Statistics www.stat.washington.edu Graph Data—An example edge weight Sij = = number reports couthored by i, j Dept of statistics technical reports Examples of graph data Social networks • friendships, work relationships • AIDS epidemiology • transactions between economic agents • internet communities (e.g usenet, chat rooms) Document databases, the web • Citations, hyperlinks (not symmetric) Computer networks Image segmentation • Data points are pixels • features are distance, contour, color, texture • Natural images, medical images, satellite images, etc Protein-protein interactions, similarities Linguistics Vector data can be transformed into pairwise data by nearest neighbor graphs by “kernelizartion” (as in SVM’s) Graph data and the similarity matrix for most of this talk Graph data can be Symmetric similarities between nodes Sij=Sji¸ 0 • e.g. number of papers co-authored Asymmetric affinities Aij ¸ 0 • e.g. number of links from site i to site j Node attributes • e.g. age, university [Symmetric dis-similarities] Overview Graph data The problem what does it mean to do classification or clustering on a graph? three approaches to grouping Clustering Semisupervised learning Kernels on graphs Other and future directions The main difference In standard tasks, data are independent vectors x = (age, number publications, ...) Training set { x1, x2, ... xn} = a set of persons sampled independently from the population In graph mining tasks, the data are the (weighted) links between graph nodes S(x,x’) =number papers co-authored by x,x’ “Training set” = the whole co-authorship network The problem Standard Standarddata tasksmining tasks data data=independent =independentvectors vectors (x (x11,,...,x ...,xnn))in2 R Rdd [labels [labels(y (y11,...,y ,...,ynn))2 in {-1,1}] {-1,1}] Classification Classification supervised supervisedlearning learning Semisupervised learning Clustering Clustering unsupervised unsupervisedlearning learning Graph mining tasks data = graph on n nodes node similarities Sij [labels (y1,...,yn) in {-1,1}] Classification supervised learning Semisupervised learning Clustering and embedding unsupervised learning Clustering 3 clusters 2 clusters Embedding Semisupervised learning (transductive classification) Three paradigms for grouping nodes in graphs Both clustering and classification can be seen as grouping Graph cuts remove some edges disconnected graph the groups are the connected components By “similar behavior” nodes i, j in the same group iff i,j “have the same pattern of connections” w.r.t other nodes By Embedding map nodes {1,2,...,n} --->{x1,x2, ..,xn} 2 Rd then use standard classification and clustering methods 1. Graph cuts Definitions node degree (or volume) D_i volume of cluster C cut between clusters C,C’ MinCut vs Multway Normalized Cut (MNCut) MinCut minimize Cut( C, C’) over all partitions C,C’ polynomial BUT: resulting partition can be imbalanced MNCut For K=2 Motivation for MNCut MNCut is smallest for the “best” clustering in many situations Sij / 1/dist(I,j) 2. “Patterns of behavior” : The random walks view Sij i Sil l j Pij i Sik Pil k volume (degree) of node i transition probability l matrix notation D = diag( D1, D2, … Dn ) ) P = D-1S j Pik k Idea: nodes i, j are grouped together, iff they transition in the same way to other clusters 2 i 2 1 j 2 i Pi,red Pi,yellow 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 2/3 1/3 2/3 1/3 2/3 1/3 2/3 1/3 3. Embedding mapping from nodes to R mapping from nodes to Rd = [f(1) f(2) ... f(d)] where fi(k) represents the k-th coordinate of node i wanted nodes that are similar mapped near each other ideally: all nodes in a group map to the same point vector with n elements Another look at Pi,C i Pi,red Pi,yellow 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 1/5 4/5 2/3 1/3 2/3 1/3 2/3 1/3 2/3 1/3 a piecewise constant function fred 2/3 1/5 1/3 4/5 not all graphs produce perfect embeddings by this method need to know the groups to obtain the embedding Three approaches to grouping, summarized 1. Minimize MNCut 2. Random walks (how?) Group by similarity of “aggregated” transitions Pi,C 3. Embedding (how?) (how?) Will show that 1. 1-2-3 are equivalent 2. a spectral algorithm to solve the problem Overview Graph data The problem Clustering Random walks: a spectral clustering algorithm spectral clustering as optimization a stability result Semisupervised learning Kernels on graphs Other and future directions Theorem 1. Lumpability Let S = n x n similarity matrix C = {C1, C2, ... CK} a clustering Then the transition probabilities Pi,C are piecewise constant iff the transition matrix P = D-1S has K piecewise constant eigenvectors Why is this important? suggests algorithm to find the grouping C • spectral algorithm grouping by the similarity of connections is a form of embedding A spectral clustering algorithm Algorithm SC (Meila & Shi, 01) (there are many other variants) INPUT: number of clusters K symmetric similarity matrix S 1. Compute transition matrix P 2. Compute K largest eigenvalues of P and their eigenvectors l1¸ l2 … ¸ lK , v1, v2, …, vK 3. Spectral mapping: map nodes to RK by 4. node i xi = ( v1i v2i … vKi ) Cluster data in RK by e.g min diameter, k-means OUTPUT : clustering C (Dasgupta & Schulman 02) Spectral clustering in a nutshell weighted graph similarity matrix S transition matrix P first K eigenvectors of P K clusters n vertices to cluster; observations are pairwise similarities normalize rows n x n, symmetric Sij¸ 0 spectral mapping clustering in RK Theorem 2. Multicut Let S = n x n similarity matrix L = I - D-1/2SD-1/2 and P = D-1S C = {C1, C2, ... CK} a clustering, Y it’s indicator matrix Then a with equality iff (v1 v2 ..vK) eigenvectors of P are piecewise constant Theorem 2. Multicut Let S = n x n similarity matrix L = I - D-1/2SD-1/2 and P = D-1S C = {C1, C2, ... CK} a clustering Then a equality iff (v1 v2 ..vK) eigenvectors of P are piecewise constant Why is this important? MNCut has quadratic expression (used later) non-trivial lower bound for MNCut (used later) for (nearly) perfect P the Spectral Clustering Algorithm minimizes MNCut Hence the SC algorithm can ce viewed in three different ways Theorem 3. Stability The eigengap of P measures the stability of the K-th principal subspace w.r.t perturbations of P Definition Theorem Let be two clusterings with Then, Significance If a stability theorem holds any two “good” clusterings are close in particular, no “good” clustering can be too far from the optimal C* Gap Corollary If then Is the bound ever informative? An experiment: S perfect + additive noise Overview Graph data The problem Clustering Semisupervised learning Kernels on graphs Other and future directions Semisupervised grouping Data (i1,y1),(i2,y2)...(il,yl) = l labeled nodes il+1,...il+u = u unlabeled nodes l+u = n Assumed that groups (classes) agree with graph stucture ignoring unlabeled data using unlabeled data MNCut as smoothness Let f 2 Rn = the labeling fi = class( node i ) In Rd smoothness functional On graph grad f fi – fj P = discrete measure The Laplace operator(s) on a graph Unnormalized Laplacian L=D–S intuitive Normalized Laplacian L = I – D-1/2SD-1/2 scale invariant compact operator • better convergence properties Graph regularized Least Squares Belkin & Nyogi ‘05 For simplicity assume K = 2 Criterion: Minimize smoothness + labeling error = regularization parameter (to be chosen) Solution Quadratic criterion linear gradient solution f* obtained by solving linear system label node i by y(i) = sign( fi ) Approach extends to K>2 classes Overview Graph data The problem Clustering Semisupervised learning Kernels on graphs graph regularized SVM heat kernels Other and future directions Kernel machines Kernel machines/ Supprt vector machines solve the problem min in an elegant way • when cost and ||f|| can be expressed in terms of a scalar product between data points the scalar product <x,x’> = K(x,x’) • defines the kernel K Our problem: define a kernel between nodes of a graph has to reflect the graph topology Kernels on graphs 1. “Manifold regularization” kernel K is given • e.g data are vectors in RN graph + S given • e.g nearest neighbors graph task = classification adds regularization (=smoothness penalty) based on unlabeled data 2. “Heat kernel” graph + S given task • find a kernel on the finite set of graph nodes • [will be use it to label the nodes as in regular SVM] Graph regularized SVM Graph given e.g nearest neighbor graph Kernel K given Problem formulation Representer theorem if || ||I smooth enough w.r.t || ||K Belkin & Nyogi ‘05 The Heat kernel Kondor & Lafferty 03 The [heat] diffusion equation f(x,t) = “temperature” = Laplace operator solution • with Kt = the heat kernel On graph heat kernel (discrete time) continuos time • t = = smoothing parameter for the kernel Generalized Heat Kernel (Smola & Kondor 03) Theorem The only linear, permutation invariant mappings S T(S)2 Rn £ n are of the form S + D + V with V = Di Idea: 1. choose a regularization norm ||f||2 = <f, Qf> 2. with Q = q(L) 3. define where Theorem <f,Qf’> defines a reproducing kernel Hilbert space (RKHS) the kernel is Overview Graph data The problem Clustering Semisupervised learning Kernels on graphs Other and future directions Other aspects and future directions Computation Selecting number of clusters K Obtaining / Learning the similarities Sij Other tasks ranking, influence, communication Incorporating constraints (prior knowledge) statistical models vector data Directed graphs/ asymmetric S matrix Computation Algorithms are polynomial but intensive all eigenvectors n3 K eigenvectors nK x iterations SVM solver • quadratic optimization problem Numerical stability Good: many graphs are sparse saves memory and computation Perfect (P,C) pair C1 A PBA B R12 PAC PCB C C2 R21 The “chain” over clusters is generally not Markov I.e, knowing past states gives information about the future Definition (P, C) is a perfect pair iff aggregated chain is Markov The spectral mapping If (P, C) perfect v1, v2,… vK first K eigenvectors of P v1 v2 v3 The spectral mapping: Data as elements of v2, v3 These eigenvectors are called piecewise constant (PC) v3 v2 The “classification error” distance computed by the maximal bipartite matching algorithm between clusters k classification confusion matrix error k’ Dkk’