Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Stochastic Streams: Sample Complexity vs. Space Complexity David Woodruff IBM Almaden Joint work with Michael Crouch, Andrew McGregor, and Greg Valiant Motivation • (Well-studied) Statistics question: how many samples from a distribution are needed to estimate a property of a distribution? • (Well-studied) Streaming question: for a given fixed stream of samples, how much space is needed to estimate a property of a distribution? • Our work: understand the tradeoff between the sample and space complexity Model 4 3 7 3 1 1 2 … • Algorithm sees a stream of i.i.d. samples from a distribution • Algorithm only has 1 pass over the samples • Goal: understand the tradeoff between the number t of samples needed to solve a problem, versus the space s of the algorithm Problems • (Statistics) Given t samples from a distribution p = p1 , … , pn on n items, estimate the collision probability i p2i up to a 1 + ϵ - factor • (Graph Problems) Given t independent edges chosen with replacement from a graph G, decide if G is connected • (Linear Algebra) Given t independent samples from a subspace S of GF 2 d , determine if S has dimension d/2 or dimension d Talk Outline • Sample/Space Tradeoff for Collision Probability Estimation • Sample/Space Tradeoff for Deciding Connectivity • Sample/Space Tradeoff for Determining if a Subspace is Full Rank Collision Probability • Given t samples from distribution p = p1 , … , pn on n items, estimate the collision probability i p2i up to a 1 + ϵ - factor • Collision probability is called F2 • If t = o(n.5 ), impossible with any amount of space • Distinguish p = 1 1 ,…, n n 2 n 2 n vs. p = ( , … , , 0, … , 0) • Our algorithm n t • For any t = Ω𝜖 (n.5 ), can 1 + ϵ -approximate F2 with t samples and Oϵ (1 + ) space Collision Probability Algorithm • Break the t samples into t/w contiguous groups of w samples 4 … 3 7 … Group 2 Group 1 3 1 … 1 … Group 3 • For each group of samples a1 , … , aw let Xi,j = 1 if ai = aj , and let 1 X= i≠j X i,j be the probability of a collision on the group w(w−1) • Use w log n bits of space to compute an estimate X for a group, and average estimates over t/w groups Collision Probability Algorithm • E X = E Xi,j = 2 p k k = F2 F2 w w−1 n.5 F2 Θ( ) w • Var X ≤ + • Chebyshev’s inequality implies a 1 + ϵ -approximation to F2 with n n.5 error probability O 2 + 2 wt ϵ • Set t > n.5 , and 𝜖2 w=O tϵ n tϵ2 Collision Probability Lower Bound • Use lower bound for random order streams • Case 1: see a stream a1 , … , at of t < n distinct items from a universe U • Case 2: see a stream a1 , … , at of t-r distinct items together with an item i which occurs r times, all from universe U • Order of streams is random t • [AOMP, GH] any streaming algorithm needs Ω 2.5 space to distinguish r t the cases, even with an infinite random tape (conjectured space: Ω 2 ) r Collision Probability Lower Bound • Choose a random function h: U -> [n] • Given a stream a1 , … , at of items in a random order, feed the algorithm for IID streams the stream h(a1 ), … , h(at ) a1 a2 a3 a4 a5 … at h(a1 ) h(a2 ) h(a3 ) h(a4 ) h(a5 ) … h(at ) Collision Probability Lower Bound • If a1 , … , at are distinct, obtain IID samples from distribution 1 1 ( ,…, ) n n • If a1 , … , at are distinct together with an item i occurring r times, roughly see IID samples from distribution 1 r 1 r r 1 r 1 r − , …to , F − , , − ,…, − ) Question: (Extend n nt n nt t n nt n nt k • If r > t/n.5 , then F2 in two cases differs by a constant factor 5 n4 • Implies w = Ω( 1.5) (Conjectured = t n Ω( )) t Talk Outline • Sample/Space Tradeoff for Collision Probability Estimation • Sample/Space Tradeoff for Deciding Connectivity • Sample/Space Tradeoff for Determining if a Subspace is Full Rank Graph Connectivity • Given t independent edges chosen with replacement from graph G, decide if G is connected • Simulate a random walk starting at node 1 • Store current vertex • If see an edge not incident to the current vertex, discard it • Remember first node i which you haven’t seen. Finish when i > n Graph Connectivity 2 Start at vertex 1 Current Vertex: 1 3 1 First Untouched Vertex: done 23 4 4 See IID Stream: {1, 4}, {2, 3}, {1, 4}, {3,4}, {1,2}, {2, 3}, {1,2}, {3,4} do nothing do nothing do nothing O(log n) space, and O m (mn2 ) = O m2 n2 samples The Loopy Graph • For each vertex v in G, add m − dv self-loops to vertex v • The resulting “loopy graph” H is m-regular and has mn edges • Perform previous algorithm on H • Each sample is important! • O(log n) space, but only O mn n2 = O mn3 samples Use More Space and Fewer Samples • We have an algorithm with O(log n) space and O mn3 samples How can we use more space but fewer samples? • Take p random walks in loopy graph, and remember current vertex of each one • O(p log n) space • Use each sample to update all p random walks • Issue: random walks not independent • Fix: can simulate independence since most walks don’t move on a given sample What to do with p independent random walks? Space/Time Tradeoff for Connectivity [Feige] • For any k, the following is a correct algorithm whp: • (Phase 1) Ensure graph has no connected components of size ≤ k • (Phase 2) Sample n log n k x vertices and verify the samples are connected x If there is asuppose component Otherwise, we with ≤ k vertices, declare are in phase 2 graph is disconnected x x Will sample a vertex from each group of k vertices Implementation in the IID Model • For any p ≤ n, can implement algorithm with O(p) space and mn2 O( 2 ) p samples • Set k = O(n/p) • Even for p = O(1), this improves our earlier O(log n) space and O mn3 samples • Phase 1: sample O(p) nodes at random, run random walk on each of them estimate in O(1) space the number of distinct nodes visited in each walk • Phase 2: sample O(p) nodes at random, run random walk on each of them keep track of which sampled nodes are connected Talk Outline • Sample/Space Tradeoff for Collision Probability Estimation • Sample/Space Tradeoff for Deciding Connectivity • Sample/Space Tradeoff for Determining if a Subspace is Full Rank Determining if a Subspace Has Full Rank • Given t IID samples from a subspace S of GF 2 d , is rank(S) = d/2 or rank(S) = d? • O(log d) space algorithm: choose a random standard unit vector ei and check if any sample equals ei • After O 2d samples, if S has full rank, will see ei • If S has rank d/2, with probability ½, will never see ei • Our Lower Bound: Any algorithm succeeding with probability > 2/3 and using o(d) space must use 2Ω d samples Statistical Query Framework • A “statistical query” algorithm (s.q. algorithm) is adaptively proposes functions f1 , f2 , … with fi : 0,1 d → −1, 1 and receives estimates of Ex fi x that are corrupted via adversarial noise • An s-query s.q. algorithm with tolerance τ is an algorithm that: • for every rank d/2 subspace S, after querying f1 , … , fs with responses r1 , … , rs with ri − Ex uniform in S [fi x ] ≤ τ, it outputs “rank d/2” with probability 3/4 • if the responses r1 , … , rs satisfy ri − Ex uniform in 0,1 d [fi (x) ≤ τ, the algorithm outputs “full rank” with probability 3/4 • Main Lemma: For any fi , d Pr Ex uniform on S 0,1 d fi x − Ex uniform on S fi x −8 >2 d −4 ≤2 Statistical Query Framework Each of t players in a communication game receives a sample from GF 2d P1 P2 P3 Pt … Each message depends on all previous messages and is at most s bits [SVG] If there is a communication protocol with probability 1-δ, then there is an δ O(st) query s.q. algorithm with probability 1-2δ with tolerance s t2 Set s = o d and t = 2Ω(d) and apply our main lemma Conclusions • Studied space versus sample tradeoffs in the data stream model • Obtained tradeoffs for statistical, graph, and linear algebra problems • Open questions: tighten our bounds • General question: unify the techniques for the different problems