Download on a graph - Department of Electrical Engineering and Computing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Mining Graph Data
Marina Meila
University of Washington
Department of Statistics
www.stat.washington.edu
Graph Data—An example

edge weight Sij =
= number reports couthored by i, j
Dept of statistics
technical reports
Examples of graph data







Social networks
• friendships, work relationships
• AIDS epidemiology
• transactions between economic agents
• internet communities (e.g usenet, chat rooms)
Document databases, the web
• Citations, hyperlinks (not symmetric)
Computer networks
Image segmentation
• Data points are pixels
• features are distance, contour, color, texture
• Natural images, medical images, satellite images, etc
Protein-protein interactions, similarities
Linguistics
Vector data can be transformed into pairwise data
 by nearest neighbor graphs
 by “kernelizartion” (as in SVM’s)
Graph data and the similarity matrix

for most of this talk
Graph data can be
 Symmetric similarities between nodes Sij=Sji¸ 0
• e.g. number of papers co-authored
 Asymmetric affinities Aij ¸ 0
• e.g. number of links from site i to site j
 Node attributes
• e.g. age, university
 [Symmetric dis-similarities]
Overview

Graph data


The problem
 what does it mean to do classification or clustering on a
graph?
 three approaches to grouping
Clustering

Semisupervised learning

Kernels on graphs

Other and future directions
The main difference

In standard tasks, data are independent vectors
 x = (age, number publications, ...)
 Training set { x1, x2, ... xn} = a set of persons sampled
independently from the population

In graph mining tasks, the data are the (weighted) links
between graph nodes
 S(x,x’) =number papers co-authored by x,x’
 “Training set” = the whole co-authorship network
The problem
Standard
Standarddata
tasksmining tasks
 data
data=independent
=independentvectors
vectors
(x
(x11,,...,x
...,xnn))in2 R
Rdd
 [labels
[labels(y
(y11,...,y
,...,ynn))2
in {-1,1}]
{-1,1}]
 Classification
Classification
 supervised
supervisedlearning
learning

Semisupervised learning
 Clustering
Clustering
 unsupervised
unsupervisedlearning
learning
Graph mining tasks
 data = graph on n nodes
 node similarities Sij
 [labels (y1,...,yn) in {-1,1}]
 Classification
 supervised learning

Semisupervised learning

Clustering and embedding
 unsupervised learning

Clustering
 3 clusters
 2 clusters

Embedding

Semisupervised learning
(transductive classification)
Three paradigms for grouping nodes in graphs

Both clustering and classification can be seen as grouping

Graph cuts
 remove some edges  disconnected graph
 the groups are the connected components

By “similar behavior”
 nodes i, j in the same group iff i,j “have the same pattern
of connections” w.r.t other nodes

By Embedding
 map nodes {1,2,...,n} --->{x1,x2, ..,xn} 2 Rd
 then use standard classification and clustering methods
1. Graph cuts

Definitions
 node degree (or volume) D_i
 volume of cluster C
 cut between clusters C,C’
MinCut vs Multway Normalized Cut (MNCut)

MinCut minimize Cut( C, C’) over all partitions C,C’
 polynomial
 BUT: resulting partition can be imbalanced

MNCut

For K=2
Motivation for MNCut

MNCut is smallest for the “best” clustering in many situations
Sij / 1/dist(I,j)
2. “Patterns of behavior” : The random walks view
Sij
i
Sil
l
j
Pij
i
Sik
Pil
k

volume (degree) of node i

transition probability
l

matrix notation
D = diag( D1, D2, … Dn )
) P = D-1S
j
Pik
k

Idea:
 nodes i, j are grouped together, iff they transition in the
same way to other clusters
2
i
2
1
j
2
i
Pi,red
Pi,yellow
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
2/3
1/3
2/3
1/3
2/3
1/3
2/3
1/3
3. Embedding

mapping from nodes to R

mapping from nodes to Rd = [f(1) f(2) ... f(d)]
 where fi(k) represents the k-th coordinate of node i

wanted
 nodes that are similar mapped near each other
 ideally: all nodes in a group map to the same point
vector with n elements
Another look at Pi,C
i
Pi,red
Pi,yellow
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
1/5
4/5
2/3
1/3
2/3
1/3
2/3
1/3
2/3
1/3
a piecewise
constant function
fred
2/3
1/5
1/3
4/5


not all graphs produce perfect embeddings by this method
need to know the groups to obtain the embedding
Three approaches to grouping, summarized
1. Minimize MNCut
2. Random walks

(how?)
Group by similarity of “aggregated” transitions Pi,C
3. Embedding

(how?)
(how?)
Will show that
1. 1-2-3 are equivalent
2. a spectral algorithm to solve the problem
Overview

Graph data

The problem

Clustering
 Random walks: a spectral clustering algorithm
 spectral clustering as optimization
 a stability result

Semisupervised learning

Kernels on graphs

Other and future directions
Theorem 1. Lumpability



Let S = n x n similarity matrix
C = {C1, C2, ... CK} a clustering
Then
 the transition probabilities Pi,C are piecewise constant
 iff the transition matrix P = D-1S has K piecewise constant
eigenvectors
Why is this important?
 suggests algorithm to find the grouping C
• spectral algorithm
 grouping by the similarity of connections is a form of
embedding
A spectral clustering algorithm
Algorithm SC
(Meila & Shi, 01)
(there are many
other variants)
INPUT: number of clusters K
symmetric similarity matrix S
1. Compute transition matrix P
2. Compute K largest eigenvalues of P and their eigenvectors
l1¸ l2 … ¸ lK , v1, v2, …, vK
3. Spectral mapping: map nodes to RK by
4.
node i 
xi = ( v1i v2i … vKi )
Cluster data in RK by e.g min diameter, k-means
OUTPUT : clustering C
(Dasgupta & Schulman 02)
Spectral clustering in a nutshell
weighted
graph
similarity
matrix S
transition
matrix P
first K
eigenvectors of P
K clusters
n vertices to
cluster;
observations
are pairwise
similarities
normalize
rows
n x n, symmetric
Sij¸ 0
spectral
mapping
clustering
in RK
Theorem 2. Multicut


Let S = n x n similarity matrix
L = I - D-1/2SD-1/2 and P = D-1S
C = {C1, C2, ... CK} a clustering, Y it’s indicator matrix
Then
 a
 with
 equality iff (v1 v2 ..vK) eigenvectors of P are piecewise
constant
Theorem 2. Multicut


Let S = n x n similarity matrix
L = I - D-1/2SD-1/2 and P = D-1S
C = {C1, C2, ... CK} a clustering
Then
 a
 equality iff (v1 v2 ..vK) eigenvectors of P are piecewise
constant


Why is this important?
 MNCut has quadratic expression (used later)
 non-trivial lower bound for MNCut (used later)
 for (nearly) perfect P the Spectral Clustering Algorithm
minimizes MNCut
Hence the SC algorithm can ce viewed in three different ways
Theorem 3. Stability

The eigengap of P

 measures the stability of the K-th principal subspace w.r.t
perturbations of P
Definition

Theorem
Let
be two clusterings with
Then,
Significance

If a stability theorem holds
 any two “good” clusterings are close
 in particular, no “good” clustering can be too far from the
optimal C*

Gap Corollary If
then
Is the bound ever informative?

An experiment: S perfect + additive noise
Overview

Graph data

The problem

Clustering

Semisupervised learning

Kernels on graphs

Other and future directions
Semisupervised grouping

Data
 (i1,y1),(i2,y2)...(il,yl) = l labeled
nodes
 il+1,...il+u = u unlabeled nodes
 l+u = n

Assumed that groups (classes)
agree with graph stucture
ignoring unlabeled data
using unlabeled data
MNCut as smoothness



Let f 2 Rn = the labeling
 fi = class( node i )
In Rd
 smoothness functional
On graph
 grad f  fi – fj
 P = discrete measure
The Laplace operator(s) on a graph

Unnormalized Laplacian
L=D–S

 intuitive
Normalized Laplacian
L = I – D-1/2SD-1/2
 scale invariant
 compact operator
• better convergence properties
Graph regularized Least Squares


Belkin & Nyogi ‘05
For simplicity assume K = 2
Criterion: Minimize smoothness + labeling error
  = regularization parameter (to be chosen)

Solution
 Quadratic criterion
linear gradient
 solution f* obtained by solving linear system
 label node i by y(i) = sign( fi )

Approach extends to K>2 classes
Overview

Graph data

The problem

Clustering

Semisupervised learning

Kernels on graphs
 graph regularized SVM
 heat kernels

Other and future directions
Kernel machines

Kernel machines/ Supprt vector machines solve the problem
 min
 in an elegant way
• when cost and ||f|| can be expressed in terms of a
scalar product between data points
 the scalar product <x,x’> = K(x,x’)
• defines the kernel K

Our problem: define a kernel between nodes of a graph
 has to reflect the graph topology
Kernels on graphs
1. “Manifold regularization”
kernel K is given
• e.g data are vectors in RN
 graph + S given
• e.g nearest neighbors graph
 task = classification
 adds regularization (=smoothness penalty) based on
unlabeled data
2. “Heat kernel”
 graph + S given
 task
• find a kernel on the finite set of graph nodes
• [will be use it to label the nodes as in regular SVM]

Graph regularized SVM


Graph given
 e.g nearest neighbor graph
Kernel K given

Problem formulation

Representer theorem
 if || ||I smooth enough w.r.t || ||K
Belkin & Nyogi ‘05
The Heat kernel

Kondor & Lafferty 03
The [heat] diffusion equation
 f(x,t) = “temperature”
  = Laplace operator
 solution
• with Kt = the heat kernel

On graph
 heat kernel (discrete time)
 continuos time
•  t =  = smoothing parameter for the kernel
Generalized Heat Kernel



(Smola & Kondor 03)
Theorem The only linear, permutation invariant mappings
S  T(S)2 Rn £ n are of the form
 S +  D +  V with V =  Di
Idea:
1. choose a regularization norm ||f||2 = <f, Qf>
2. with Q = q(L)
3. define
 where
Theorem
 <f,Qf’> defines a reproducing kernel Hilbert space
(RKHS)
 the kernel is
Overview

Graph data

The problem

Clustering

Semisupervised learning

Kernels on graphs

Other and future directions
Other aspects and future directions

Computation

Selecting number of clusters K

Obtaining / Learning the similarities Sij

Other tasks
 ranking, influence, communication

Incorporating
 constraints (prior knowledge)
 statistical models
 vector data

Directed graphs/ asymmetric S matrix
Computation

Algorithms are polynomial but intensive
 all eigenvectors n3
 K eigenvectors nK x iterations
 SVM solver
• quadratic optimization problem


Numerical stability
Good: many graphs are sparse
 saves memory and computation
Perfect (P,C) pair
C1
A
PBA
B


R12
PAC
PCB
C
C2
R21
The “chain” over clusters is generally not Markov
 I.e, knowing past states gives information about the
future
Definition (P, C) is a perfect pair iff aggregated chain is
Markov
The spectral mapping
If (P, C) perfect
v1, v2,… vK first K eigenvectors of P
v1
v2
v3
The spectral mapping: Data as elements of v2, v3
These
eigenvectors are
called piecewise
constant (PC)
v3
v2

The “classification error” distance
 computed by the maximal bipartite matching algorithm
between clusters
k
classification
confusion
matrix
error
k’
Dkk’