Download PPV(i,j) - Videolectures

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Purnamrita Sarkar (Carnegie Mellon)
Andrew W. Moore (Google, Inc.)
 Which nodes are most similar to node i
maximum
paper-has-word
Paper #1
margin
classification
paper-cites-paper
paper-has-word
large
scale
SVM
Paper #2
Friend suggestion in Facebook
Keyword specific search in DBLP
• Random walk based measures
- Personalized pagerank
- Hitting and Commute times
- ...
Intuitive measures of similarity
Successfully used for many
applications
• Possible query types:
• Find k most relevant papers about “support vector machines”
• Queries can be arbitrary
• Computing these measures at query-time is still an active area
of research.
3
• Most algorithms1 typically examine local
neighborhoods around the query node
- High degree nodes make them slow.
• When the graph is too large for memory
- Streaming algorithms2
Require multiple passes over the entire dataset.
• We want a external memory framework, that supports
• Arbitrary queries
• Amenable to many random walk based measures
1. Berkhin 2006, Anderson et al 2006, Chakrabarti 2007, Sarkar & Moore 2007
2. Das Sarma et al, 2008.
• Introduction to some measures
• High degree nodes
• Disk-resident large graphs
• Results
• Personalized Pagerank (PPV)
• Start at node i
• At any step reset to node i with probability α
• Stationary distribution of this process
• Discounted Hitting Time
• Start at node i
• At any step stop if you hit j or with probability α
• Expected time to stop
6
• Introduction to some measures
• High degree nodes
- Effect on personalized pagerank
- Effect on discounted hitting times
• Disk-resident large graphs
• Results
• Effect of high-degree nodes on random walks
• High degree nodes can blow up neighborhood size.
• Bad for computational efficiency.
• Real world graphs with power law degree distribution
• Very small number of high degree nodes
• But easily reachable because of the small world property
8
• Main idea:
• When a random walk hits a high degree node, only a
tiny fraction of the probability mass gets to its
neighbors.
t
t+1
p/1000
p/1000
p
degree=1000
p/1000
degree=1000
}
• Stop the random walk when it hits a high degree node
 Turn the high degree nodes into sink nodes.
9
• We
are computing
personalized pagerank from
Undirected
Graphs
node i
• Can show that error at any node j is ≤
dj
ds
• If we make node s into sink
• PPV(i,j) will decrease
• By how much?
• Can prove: the contribution through s is
• probability of hitting s from i * PPV (s,j)
• Is PPV(s,j) small if s has huge degree?
10
• Discounted hitting times: hitting times with a α
probability of stopping at any step.
• Main intuition:
• PPV(i,j) = Prα(hitting j from i) * PPV(j,j)
We
1  PPV(i, j) 
show  h α (i, j)  1 

α
PPV(j, j) 
Individual
popularity is
normalized out
Small effect on PPV  small effect on
11
• Introduction to some measures
• High degree nodes
• Disk-resident large graphs
• Results
• Similar nodes should be placed nearby on a disk
• Cluster the graph into page-size chunks
 random walk will stay mostly inside a good cluster
 less computational overhead
• A real example
Robotics
howie_choset
david_apfelbau
u
john_langford
kurt_k
michael_
krell
kamal_nigam
michael_beetz
Grey
larry_wasserman
Machine learning
and Statistics
thomas_hoffmann
nodes
are inside the cluster
tom_m_mitchell
daurel_
14
A random walk mostly stays
inside a good cluster
Wolfram Burgard
Dieter Fox
Mark Craven
Kamal Nigam
Dirk Schulz
Armin Cremers
Tom Mitchell
Top 7 nodes in personalized
pagerank from Sebastian Thrun
Grey nodes are inside the cluster
15
1. Load cluster in memory.
2. Start random walk
Can also maintain a
LRU buffer to store the
clusters in memory.
 Can we do better than sampling?
Page-fault
time the
How to cluster
anevery
external
memory graph?
walk leaves the cluster
A Page-fault is recorded when a new page is loaded.
Number of page-faults on average
Ratio of cross edges with total number of edges
Quality of a cluster
16
• Only compute PPV on the current cluster.
• No information from rest of the graph.
• Poor approximation
?
j
NBj
17
• Upper and lower bounds on PPV(i,j) for i in NBj
• Add new clusters when you expand.
• Maintain a global upper bound ub for nodes outside
• Stop when ub ≤ β  all nodes outside have small PPV
Expand
?
ub
Many fewer page-faults than sampling!
j
NBj
We can also compute hitting time to node j using this algorithm.
18
• Pick a measure for clustering
• Personalized pagerank – has been shown to yield good
clustering1
• Compute PPV from a set of A anchor nodes, and assign a
node to its closest anchor.
• How to compute it on disk?
• Personalized pagerank on disk
• Nodes/edges do not fit in memory: no random access
 RWDISK
R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS '06.
19
• Compute personalized pagerank using power
iterations
• Each iteration = One matrix-vector multiplication
• Can compute by join operations between two
lexicographically sorted files.
• Intermediate files can be large
 Round the small probabilities to zero at any step.1
 Has bounded error, but brings down file-size from
O(n2)  O(|E|)
Spielman and Teng 2004, Sarlos et al. 2006
20
• Introduction to some measures
• High degree nodes
• Disk-resident large graphs
• Results
• Turning high degree nodes into sinks
•Deterministic Algorithm vs. Sampling
•RWDISK on external memory graphs
Yields better clusters than METIS with much less memory
requirement. (will skip for now)
22
• Citeseer subgraph : co-authorship graphs
• DBLP : paper-word-author graphs
• LiveJournal: online friendship network
23
Dataset
Minimum degree
of Sink Nodes
Accuracy
Page-faults
Citeseer
None
0.74
69
100
0.74
67
None
0.1
1881
1000
0.58
231
None
0.2
1502
100
0.43
255
DBLP
LiveJournal
6 times better
2 times better
8 times less
6 times less
24
10 times less than sampling
Dataset
Mean page-faults
Citeseer
6
DBLP
54
LiveJournal
64
4 times less than
sampling
4 times less than sampling
25
• Turning high degree nodes into sinks
Has bounded effect on personalized pagerank and hitting time
Significantly improves the time of RWDISK (3-4 times).
Improves number of page-faults in sampling a random walk
Improves link prediction accuracy
•Search Algorithms on a clustered framework
Sampling is easy to implement and can be applied widely
Deterministic algorithm
Guaranteed to not miss a potential nearest neighbor
 Improves number of page-faults significantly over sampling.
•RWDISK
Fully external memory algorithm for clustering a graph
26
Thanks!
• Personalized Pagerank
• Start at node i
• At any step reset to node i with probability α
• Stationary distribution of this process
PPV(i,j) = α Σt (1- α)t Pt(i,j)
PPV from i to j
Pr(X t  j | X0  i)
28
• Maintain ub(NBj) for all nodes outside NBj
• Stop when ub≤α
• Guaranteed to return
all nodes with PPV(i,j)≥ α
Expand
ub(NBj)
j
NBj
Many fewer page-faults than sampling!
We can also compute PPV to node j using this algorithm.
29
Dataset
DBLP
LiveJournal
Sink Nodes
Time
Minimum degree
Number of sinks
None
0
≥ 2.5 days
1000
900
11 hours
1000
950
60 hours
100
134K
17 hours
4 times faster
Minimum degree
of a sink node
Number of sinks
3 times faster
30
The histograms for the expected number of pagefaults if a random walk
stepped outside a cluster for a randomly picked node. Left to right the panels
are for Citeseer, DBLP and LiveJournal.