Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Purnamrita Sarkar (Carnegie Mellon) Andrew W. Moore (Google, Inc.) Which nodes are most similar to node i maximum paper-has-word Paper #1 margin classification paper-cites-paper paper-has-word large scale SVM Paper #2 Friend suggestion in Facebook Keyword specific search in DBLP • Random walk based measures - Personalized pagerank - Hitting and Commute times - ... Intuitive measures of similarity Successfully used for many applications • Possible query types: • Find k most relevant papers about “support vector machines” • Queries can be arbitrary • Computing these measures at query-time is still an active area of research. 3 • Most algorithms1 typically examine local neighborhoods around the query node - High degree nodes make them slow. • When the graph is too large for memory - Streaming algorithms2 Require multiple passes over the entire dataset. • We want a external memory framework, that supports • Arbitrary queries • Amenable to many random walk based measures 1. Berkhin 2006, Anderson et al 2006, Chakrabarti 2007, Sarkar & Moore 2007 2. Das Sarma et al, 2008. • Introduction to some measures • High degree nodes • Disk-resident large graphs • Results • Personalized Pagerank (PPV) • Start at node i • At any step reset to node i with probability α • Stationary distribution of this process • Discounted Hitting Time • Start at node i • At any step stop if you hit j or with probability α • Expected time to stop 6 • Introduction to some measures • High degree nodes - Effect on personalized pagerank - Effect on discounted hitting times • Disk-resident large graphs • Results • Effect of high-degree nodes on random walks • High degree nodes can blow up neighborhood size. • Bad for computational efficiency. • Real world graphs with power law degree distribution • Very small number of high degree nodes • But easily reachable because of the small world property 8 • Main idea: • When a random walk hits a high degree node, only a tiny fraction of the probability mass gets to its neighbors. t t+1 p/1000 p/1000 p degree=1000 p/1000 degree=1000 } • Stop the random walk when it hits a high degree node Turn the high degree nodes into sink nodes. 9 • We are computing personalized pagerank from Undirected Graphs node i • Can show that error at any node j is ≤ dj ds • If we make node s into sink • PPV(i,j) will decrease • By how much? • Can prove: the contribution through s is • probability of hitting s from i * PPV (s,j) • Is PPV(s,j) small if s has huge degree? 10 • Discounted hitting times: hitting times with a α probability of stopping at any step. • Main intuition: • PPV(i,j) = Prα(hitting j from i) * PPV(j,j) We 1 PPV(i, j) show h α (i, j) 1 α PPV(j, j) Individual popularity is normalized out Small effect on PPV small effect on 11 • Introduction to some measures • High degree nodes • Disk-resident large graphs • Results • Similar nodes should be placed nearby on a disk • Cluster the graph into page-size chunks random walk will stay mostly inside a good cluster less computational overhead • A real example Robotics howie_choset david_apfelbau u john_langford kurt_k michael_ krell kamal_nigam michael_beetz Grey larry_wasserman Machine learning and Statistics thomas_hoffmann nodes are inside the cluster tom_m_mitchell daurel_ 14 A random walk mostly stays inside a good cluster Wolfram Burgard Dieter Fox Mark Craven Kamal Nigam Dirk Schulz Armin Cremers Tom Mitchell Top 7 nodes in personalized pagerank from Sebastian Thrun Grey nodes are inside the cluster 15 1. Load cluster in memory. 2. Start random walk Can also maintain a LRU buffer to store the clusters in memory. Can we do better than sampling? Page-fault time the How to cluster anevery external memory graph? walk leaves the cluster A Page-fault is recorded when a new page is loaded. Number of page-faults on average Ratio of cross edges with total number of edges Quality of a cluster 16 • Only compute PPV on the current cluster. • No information from rest of the graph. • Poor approximation ? j NBj 17 • Upper and lower bounds on PPV(i,j) for i in NBj • Add new clusters when you expand. • Maintain a global upper bound ub for nodes outside • Stop when ub ≤ β all nodes outside have small PPV Expand ? ub Many fewer page-faults than sampling! j NBj We can also compute hitting time to node j using this algorithm. 18 • Pick a measure for clustering • Personalized pagerank – has been shown to yield good clustering1 • Compute PPV from a set of A anchor nodes, and assign a node to its closest anchor. • How to compute it on disk? • Personalized pagerank on disk • Nodes/edges do not fit in memory: no random access RWDISK R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In FOCS '06. 19 • Compute personalized pagerank using power iterations • Each iteration = One matrix-vector multiplication • Can compute by join operations between two lexicographically sorted files. • Intermediate files can be large Round the small probabilities to zero at any step.1 Has bounded error, but brings down file-size from O(n2) O(|E|) Spielman and Teng 2004, Sarlos et al. 2006 20 • Introduction to some measures • High degree nodes • Disk-resident large graphs • Results • Turning high degree nodes into sinks •Deterministic Algorithm vs. Sampling •RWDISK on external memory graphs Yields better clusters than METIS with much less memory requirement. (will skip for now) 22 • Citeseer subgraph : co-authorship graphs • DBLP : paper-word-author graphs • LiveJournal: online friendship network 23 Dataset Minimum degree of Sink Nodes Accuracy Page-faults Citeseer None 0.74 69 100 0.74 67 None 0.1 1881 1000 0.58 231 None 0.2 1502 100 0.43 255 DBLP LiveJournal 6 times better 2 times better 8 times less 6 times less 24 10 times less than sampling Dataset Mean page-faults Citeseer 6 DBLP 54 LiveJournal 64 4 times less than sampling 4 times less than sampling 25 • Turning high degree nodes into sinks Has bounded effect on personalized pagerank and hitting time Significantly improves the time of RWDISK (3-4 times). Improves number of page-faults in sampling a random walk Improves link prediction accuracy •Search Algorithms on a clustered framework Sampling is easy to implement and can be applied widely Deterministic algorithm Guaranteed to not miss a potential nearest neighbor Improves number of page-faults significantly over sampling. •RWDISK Fully external memory algorithm for clustering a graph 26 Thanks! • Personalized Pagerank • Start at node i • At any step reset to node i with probability α • Stationary distribution of this process PPV(i,j) = α Σt (1- α)t Pt(i,j) PPV from i to j Pr(X t j | X0 i) 28 • Maintain ub(NBj) for all nodes outside NBj • Stop when ub≤α • Guaranteed to return all nodes with PPV(i,j)≥ α Expand ub(NBj) j NBj Many fewer page-faults than sampling! We can also compute PPV to node j using this algorithm. 29 Dataset DBLP LiveJournal Sink Nodes Time Minimum degree Number of sinks None 0 ≥ 2.5 days 1000 900 11 hours 1000 950 60 hours 100 134K 17 hours 4 times faster Minimum degree of a sink node Number of sinks 3 times faster 30 The histograms for the expected number of pagefaults if a random walk stepped outside a cluster for a randomly picked node. Left to right the panels are for Citeseer, DBLP and LiveJournal.