Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology Problem One: Missing Card I take one from a deck of 52 cards, and pass the rest to you. Suppose you only have a (very basic) calculator and bad memory, how can you find out which card is missing with just one pass over the 51 cards? What if there are two missing cards? 2 A data stream algorithm … 3 Makes one pass over the input data Uses a small amount of memory (much smaller than the input data) Computes something Why do we need streaming algorithms Networking Often get to see the data once Don’t want to store the entire data Databases Data stored on disk, sequential scans are much faster Data stream algorithms have been a very active research area for the past 15 years Problems considered today 4 Missing card Majority Heavy hitters Problem two: Majority Given a sequence of items, find the majority if there is one AA B C D B AA B B AAAAAA C C C D A B AAA Answer: A Trivial if we have O(n) memory Can you do it with O(1) memory and two passes? First pass: find the possible candidate Second pass: compute its frequency and verify that it is > n/2 How about one pass? 5 Unfortunately, no Problem three: Heavy hitters Problem: find all items with counts > φn, for some 0< φ<n Relaxation: If an item has count > φ n, it must be reported, together with its estimated count with (absolute) error < εn If an item has count < (φ − ε) n, it cannot be reported For items in between, don’t care In fact, we will solve the most difficult case φ = ε Applications 6 Frequent IP addresses Data mining The algorithm [Metwally, Agrawal, Abbadi, 2006] Input parameters: number of counters m, stream S Algorithm: for each element e in S { if e is monitored { find counter of e, counteri; counteri++; } else { find the element with least frequency, em, denote its frequency min; replace em with e; assign counter for e with min+1; } } 7 Properties of the Algorithm Actual count of a monitored item ≤ counter Actual count of a monitored item ≥ counter – min Actual count of an item not monitored ≤ min Proof by induction The sum of the counters maintained is n Why? So min <= n/m If we set m = 1/ε, it’s sufficient to solve the heavy hitter problem 8 Why? So the heavy hitter problem can be solved in O(1/ε) space How about deletions? Any deterministic algorithm needs Ω(n) space Why? In fact, Las Vegas randomization doesn’t help Will design a randomized algorithm that works with high probability 9 For any item x, we can estimate its actual count within error εn with probability 1-δ for any small constant δ The Count-Min sketch [Cormode, Muthukrishnan, 2003] A Count-Min (CM) Sketch with parameters ( , ) is represented by a two-dimensional array counts with width w and depth d : count[1,1] count[d , w] Given parameters 2 1 ( , ) , set w and d log . Each entry of the array is initially zero. d hash functions are chosen uniformly at random from a 2-univeral family For example, we can choose a prime number p > u, and random aj, bj, j=1,…,d. Define: h1 ,, hd : {1n} {1 w} hj(x) = (aj x + bj mod p) mod w Property: for any x ≠ y, Pr[hj(x)=hj(y)] ≤ 1/w 10 Updating the sketch Update procedure : When item x arrives, set 1 j d count[ j , h j ( x)] count[ j , h j ( x)] 1 h1 x hd 1 1 1 1 When item x is deleted, do the same except changing +1 to -1 11 Estimating the count of x aˆ x min count[ j, h j ( x)] Q (x ) j actual count Theorem 1 estimated count ax aˆ x Pr[aˆ x ax n] 12 Proof We introduce indicator variables I x, y, j ( x y ) (h j ( x) h j ( y )) 1 if 0 otherwise 1 E ( I x , y , j ) Pr[ h j ( x) h j ( y )] w 2 Define the variable I x, j I x, y , j a y y By construction, 13 count[ j , h j ( x)] a x I x , j min count[ j , h j (i)] ai For the other direction, observe that E ( I x , j ) E I x , y , j a y a y E ( I x , y , j ) n / 2 y y Pr[ aˆ x a x n] Pr[j , count[ j , h j ( x)] a x n] Pr[j , a x I x , j a x n] Pr[j, I x, j 2E( I x, j )] 2d Markov inequality Pr[ X t ] E( X ) t t 0 1 1 So, the Count-Min Sketch has size O log 14 ■