Download PPT - the Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Practical Graph Mining with R
Graph-based Proximity Measures
Nagiza F. Samatova
William Hendrix
John Jenkins
Kanchana Padmanabhan
Arpan Chakraborty
Department of Computer Science
North Carolina State University
Outline
• Defining Proximity Measures
• Neumann Kernels
• Shared Nearest Neighbor
2
Similarity and Dissimilarity
• Similarity
–
–
–
–
Numerical measure of how alike two data objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]:
Examples: Cosine, Jaccard, Tanimoto,
• Dissimilarity
–
–
–
–
Numerical measure of how different two data objects are
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
• Proximity refers to a similarity or dissimilarity
Src: “Introduction to Data Mining” by Vipin Kumar et al
3
Distance Metric
•
Distance d (p, q) between two points p and q is a
dissimilarity measure if it satisfies:
1. Positive definiteness:
d (p, q)  0 for all p and q and
d (p, q) = 0 only if p = q.
2. Symmetry: d (p, q) = d (q, p) for all p and q.
3. Triangle Inequality:
d (p, r)  d (p, q) + d (q, r) for all points p, q, and r.
•
Examples:
–
–
–
Euclidean distance
Minkowski distance
Mahalanobis distance
Src: “Introduction to Data Mining” by Vipin Kumar et al
4
Is this a distance metric?
p  ( p1 , p2 ,...., pd ) 
d ( p, q)  max( p j , q j )
d
and q  (q1 , q2 ,...., qd ) 
d
Not: Positive definite
1 j d
Not: Symmetric
d ( p, q)  max( p j  q j )
1 j d
d ( p, q ) 
Not: Triangle Inequality
d
2
(
p

q
)
 j j
j 1
d ( p, q)  min | p j  q j |
1 j d
Distance Metric
5
Distance: Euclidean, Minkowski, Mahalanobis
p  ( p1 , p2 ,...., pd ) 
d ( p, q ) 
( p
j 1
j
and q  (q1 , q2 ,...., qd ) 
Minkowski
Euclidean
d
d
 qj )
2
 d
r 
d r ( p, q )    | p j  q j | 
 j 1

d
Mahalanobis
1
r
d ( p, q)  ( p  q)1 ( p  q)T
r  1:
City block distance
Manhattan distance
L1 -norm
r  2:
Euclidean, L2 -norm
6
Euclidean Distance
d ( p, q ) 
d
2
(
p

q
)
 j j
j 1
Standardization is necessary, if scales differ.
p  ( p1 , p2 ,...., pd ) 
Mean of attributes
1 d
p   pk 
d k 1
d
Ex:
p  (age, salary )
Standard deviation of attributes
1 d
2
sp 
(
p

p
)


k
d  1 k 1
Standardized/Normalized Vector
pnew
pd  p
p1  p p2  p
p p

(
,
,...,
)
sp
sp
sp
sp
d
pnew  0
s pnew  1
7
Distance Matrix
d ( p, q ) 
d
2
(
p

q
)
 j j
j 1
• P = as.matrix (read.table(file=“points.dat”));
• D = dist (P[, 2;3], method = "euclidean");
• L1 = dist (P[, 2;3], method = “minkowski", p=1);
• help (dist)
3
Input Data Table: P
point
p1
p2
p3
p4
p1
2
p3
p4
1
p2
0
0
1
2
3
4
5
6
x
0
2
3
5
y
2
0
1
1
File name: points.dat
Output Distance Matrix: D
p1
p1
p2
p3
p4
0
2.828
3.162
5.099
Src: “Introduction to Data Mining” by Vipin Kumar et al
p2
2.828
0
1.414
3.162
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
8
Covariance of Two Vectors, cov(p,q)
p  ( p1 , p2 ,...., pd ) 
d
and q  (q1 , q2 ,...., qd ) 
One definition:
cov( p, q)  s pq
d
Mean of attributes
1 d

( pk  p )(qk  q ) 

d  1 k 1
1 d
p   pk 
d k 1
Or a better definition:
cov( p, q)  E[( p  E ( p))(q  E (q))T ] 
E is the Expected values of a random variable.
9
Covariance, or Dispersion Matrix,
N points in d-dimensional space:
P1  ( p11 , p12 ,...., p1d ) 

d
.....
PN  ( pN 1 , pN 2 ,...., pNd ) 
d
The covariance, or dispersion matrix:
 cov( P1 , P1 ) cov( P1 , P2 )
 cov( P , P ) cov( P , P )
2
1
2
2

( P1 , P2 ,..., PN ) 

...
...

cov( PN , P1 ) cov( PN , P2 )
... cov( P1 , PN ) 
... cov( P2 , PN ) 

...
...

... cov( PN , PN ) 
The inverse, Σ-1, is concentration matrix or precision matrix
10
Common Properties of a Similarity
• Similarities, also have some well known
properties.
– s(p, q) = 1 (or maximum similarity) only if p = q.
– s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points
(data objects), p and q.
Src: “Introduction to Data Mining” by Vipin Kumar et al
11
Similarity Between Binary Vectors
• Suppose p and q have only binary attributes
• Compute similarities using the following quantities
– M01 = the number of attributes where p was 0 and q was 1
– M10 = the number of attributes where p was 1 and q was 0
– M00 = the number of attributes where p was 0 and q was 0
– M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients:
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of 11 matches / number of not-both-zero
attributes values
= (M11) / (M01 + M10 + M11)
Src: “Introduction to Data Mining” by Vipin Kumar et al
12
SMC versus Jaccard: Example
p= 1000000000
q= 0000001001
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00)
= (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
13
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| , where:
 indicates vector dot product and
|| d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
cos( d1, d2 ) = .3150
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Src: “Introduction to Data Mining” by Vipin Kumar et al
14
Extended Jaccard Coefficient (Tanimoto)
• Variation of Jaccard for continuous or count
attributes
– Reduces to Jaccard for binary attributes
Src: “Introduction to Data Mining” by Vipin Kumar et al
15
Correlation (Pearson Correlation)
• Correlation measures the linear relationship
between objects
• To compute correlation, we standardize data
objects, p and q, and then take their dot product
pk  ( pk  mean( p)) / std ( p)
qk  (qk  mean(q)) / std (q)
correlation( p, q)  p  q
Src: “Introduction to Data Mining” by Vipin Kumar et al
16
Visually Evaluating Correlation
Scatter plots showing the
similarity from –1 to 1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
17
General Approach for Combining Similarities
•
Sometimes attributes are of many different
types, but an overall similarity is needed.
Src: “Introduction to Data Mining” by Vipin Kumar et al
18
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
– Use weights wk which are between 0 and 1 and sum to 1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
19
Graph-Based Proximity Measures
In order to apply graphbased data mining
techniques, such as
classification and clustering,
it is necessary to define
proximity measures between
data represented in graph
form.
Within-graph
proximity measures:
Hyperlink-Induced
Topic Search (HITS)
The Neumann
Kernel
Shared Nearest
Neighbor (SNN)
√
Outline
• Defining Proximity Measures
• Neumann Kernels
• Shared Nearest Neighbor
21
Neumann Kernels: Agenda
Neumann
Kernel
Introduction
Co-citation and
Bibliographic
Coupling
Document and
Term
Correlation
Diffusion/Decay
factors
Relationship to
HITS
Strengths and
Weaknesses
Neumann Kernels (NK)
 Generalization of HITS
 Input: Undirected or Directed
Graph
 Output: Within Graph
Proximity Measure

Importance

Relatedness
von Neumann
NK: Citation graph
n1
n2
n3
n4
n5
n6
n7
n8
• Input: Graph
– n1…n8 vertices (articles)
– Graph is directed
– Edges indicate a citation
• Citation Matrix C can be formed
– If an edge between two vertices exists then the
matrix cell = 1 else = 0
NK: Co-citation graph
n1
n5
n2
n6
n3
n4
n7
n8
• Co-citation graph: A graph which has two nodes
connected if they appear simultaneously in the reference
list of a third node in citation graph.
• In above graph n1 and n2 are connected because
both are referenced by same node n5 in citation
graph
• CC=CTC
NK: Bibliographic Coupling Graph
n1
n5
n2
n6
n3
n4
n7
n8
 Bibliographic coupling graph: A graph which has two nodes
connected if they share one or more bibliographic references.
 In above graph n5 and n6 are connected because both are
referencing same node n2 in citation graph
 CC=C CT
NK: Document and Term Correlation
Term-document matrix: A matrix in which the rows represent
terms, columns represent documents, and entries represent a function
of their relationship
(e.g. frequency of the given term in the document).
Example:
D1: “I like this book”
D2: “We wrote this book”
Term-Document Matrix X
NK: Document and Term Correlation (2)
Document correlation matrix: A matrix in which the rows and
the columns represent documents, and entries represent the semantic
similarity between two documents.
Example:
D1: “I like this book”
D2: “We wrote this book”
Document Correlation matrix K = (XTX)
NK: Document and Term Correlation (3)
Term Correlation Matrix:- A matrix in which the rows and the columns represent
terms, and entries represent the semantic similarity between two terms.
Example:
D1: “I like this book”
D2: “We wrote this book”
Term Correlation Matrix T = (XXT)
Neumann Kernel Block Diagram
.
Input:
Graph
Output: Two matrices of dimensions n x n called K γ and Tγ
Diffusion/Decay Factor: A tunable parameter that controls the
balance between relatedness and importance
NK: Diffusion Factor - Equation & Effect
Neumann Kernel defines two matrices incorporating a
diffusion factor:
Simplifies with our
definitions of K and T
When
When
NK: Diffusion Factor - Terminology
Indegree = The indegree, δ-(v), of vertex v
is the number of edges leading to vertex v.
δ- (B)=1
Outdegree = The outdegree, δ+(v), of
vertex v is the number of edges leading
away from vertex v.
δ+(A)=3
Maximal indegree= The maximal
indegree, Δ-, of the graph is the maximum
of all indegree counts of all vertices of
graph.
Δ-(G)= 2
Maximal outdegree= The maximal
outdegree, Δ+, of the graph is the maximum
of all outdegree counts of all vertices of
graph.
Δ+(G)= 3
A
B
C
D
NK: Diffusion Factor - Algorithm
NK: Choice of Diffusion Factor and its effects
on the Neumann Algorithm
• Neumann Kernel outputs relatedness between
documents and between terms when g = γ
• Similarly when γ is larger, then the Kernel
output matches with HITS
Comparing NK, HITS, and
Co-citation Bibliographic Coupling
n1
n2
n3
n4
n5
n6
n7
n8
HITS. authority ranking for above graph
n3 > n 4 > n 2 > n 1 > n 5 = n 6 = n 7 = n 8
Calculation of Neumann Kenel for gamma=0.207 which is
maximum possible value of gamma for this case gives following
ranking
n3 > n 4 > n 2 > n 1 > n 5 = n 6 = n 7 = n 8
For higher values of gamma Neumann Kernel converges to HITS
Strengths and Weaknesses
Strengths
Weaknesses
Generalization
of HITS
Topic Drift
Merges
relatedness
and
importance
No penalty for
loops in
adjacency
matrix
Useful in many
graph
applications
Outline
• Defining Proximity Measures
• Neumann Kernels
• Shared Nearest Neighbor
37
Shared Nearest Neighbor (SNN)
• An indirect approach
to similarity
• Uses a dynamic
method of a kNearest Neighbor
graph to determine
the similarity
between the nodes
• If two vertices have
more than k
neighbors in common
then they can be
considered similar to
one another even if a
direct link does not
exist
SNN - Agenda
Understanding Proximity
Proximity Graphs
Shared Nearest Neighbor Graph
SNN Algorithm
Time Complexity
R Code Example
Outlier/Anomally Detection
Strengths
Weaknesses
SNN – Understanding Proximity
What makes a node a
neighbor to another
node is based off of the
definition of proximity
Definition: the
closeness between
a set of objects
Proximity can
measure the extent
to which the two
nodes belong to
the same cluster.
Proximity is a
subtle notion
whose definition
can depend on a
specific application
SNN - Proximity Graphs
• A graph obtained by connecting two points,
in a set of points, by an edge if the two
points, in some sense, are close to each
other
SNN – Proximity Graphs
(continued)
1
2
3
4
5
1
6
5
LINEAR
Various Types of
Proximity Graphs
2
4
7
6
RADIAL
2
1
5
3
CYCLIC
3
4
SNN – Proximity Graphs
(continued)
GABRIEL GRAPH
Other types of
proximity
graphs.
NEAREST NEIGHBOR
GRAPH
(Voronoi diagram)
MINIMUM SPANNING
TREE
RELATIVE NEIGHBOR
GRAPH
SNN – Proximity Graphs (continued)
Represents neighbor relationships
between objects
Can estimate the likelihood that a
link will exist in the future, or is
missing in the data for some reason
Using a proximity graph increases
the scale range over which good
segmentations are possible
Can be formulated with respect to
many metrics
SNN – Kth Nearest Neighbor (k-NN)
Graph
Forms the basis for the
Shared Nearest
Neighbor (SNN)
within-graph proximity
measure
Has applications in
cluster analysis and
outlier detection
SNN – Shared Nearest Neighbor Graph
• An SNN graph is a
special type of KNN
graph.
• If an edge exists between
two vertices, then they
both belong to each
other’s k-neighborhood
In the figure to the left, each of the two
black vertices, i and j, have eight nearest
neighbors, including each other. Four of
those nearest neighbors are shared which
are shown in red. Thus, the two black
vertices are similar when parameter k=4
for SNN graph.
SNN – The Algorithm
Input: G: an undirected graph
Input: k: a natural number (number of shared neighbors)
for i = 1 to N(G) do
for j = i+1 to N(G) do
if j < = N(G) then
counter = 0
end if
for m = 1 to N(G) do
if vertex i and vertex j both have an edge with vertex m
then
counter ++
end if
end for
if counter k then
Connect an edge between vertex i and vertex j in SNN
graph.
end if
end for
end for
return SNN graph
SNN – Time Complexity
 The number of vertices of graph G can be
defined as n
for i = 1
to n
for j = 1
to n
for k = 1
to n
 “for loops” i and
k iterate once for
each vertex in
graph G (n
times)
 “for loop” j
iterates at
most n -1
times (O(n))
 Cumulatively
this results in
a total
running time
of:
O(n3)
SNN – R Code Example
•
•
•
•
•
•
•
•
library(“igraph”)
library(“ProximityMeasure”)
data =
c(
0, 1, 0, 0, 1, 0,
1, 0, 1, 1, 1, 0,
0, 1, 0, 1, 0, 0,
0, 1, 1, 0, 1, 1,
1, 1, 0, 1, 0, 0,
0, 0, 0, 1, 0, 0)
mat = matrix(data,6,6)
G = graph.adjacency(mat,mode=c("directed"),
weighted=NULL)
V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’)
tkplot(G)
SNN(mat, 2)
[0] A -- D
[1] B -- D
[2] B -- E
[3] C -- E
A
E
B
D
C
F
SNN – Outlier/Anomaly Detection
Outlier/Anomaly
Outlier/Anomaly
• something that
deviates from what is
standard, normal, or
expected
Outlier/Anomaly
Detection
• detecting patterns in
a given data set that
do not conform to an
established normal
behavior
3.5
3
2.5
2
1.5
1
0.5
0
0
1
2
3
SNN - Strengths
Ability to handle noise
and outliers
Ability to handle clusters
of different sizes and
shapes
Very good at handling
clusters of varying
densities
SNN - Weaknesses
Does not take into account
the weight of the link
between the nodes in a
nearest neighbor graph
A low similarity amongst
nodes of the same cluster in
a graph can cause it to find
nearest neighbors that are
not in the same cluster
Time Complexity Comparison
Run Time
HITS
Nuemann Kernel
Shared Nearest
Neighbor
O(k*n2.376)
O(n2.376)
O(n3)
Conclusion:
Nuemann Kernel <= HITS < SNN