Download Efficient Computation of Frequent and Top

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Data Stream Algorithms
Ke Yi
Hong Kong University of Science and Technology
Problem One: Missing Card

I take one from a deck of 52 cards, and pass the rest to
you. Suppose you only have a (very basic) calculator and
bad memory, how can you find out which card is missing
with just one pass over the 51 cards?

What if there are two missing cards?
2
A data stream algorithm …



3
Makes one pass over the input data
Uses a small amount of memory (much smaller than the
input data)
Computes something
Why do we need streaming algorithms

Networking



Often get to see the data once
Don’t want to store the entire data
Databases


Data stored on disk, sequential scans are much faster
Data stream algorithms have been a very active research
area for the past 15 years
Problems considered today




4
Missing card
Majority
Heavy hitters
Problem two: Majority

Given a sequence of items, find the majority if there is one

AA B C D B AA B B AAAAAA C C C D A B AAA
Answer: A


Trivial if we have O(n) memory
Can you do it with O(1) memory and two passes?




First pass: find the possible candidate
Second pass: compute its frequency and verify that it is > n/2
How about one pass?

5
Unfortunately, no
Problem three: Heavy hitters
Problem: find all items with counts > φn, for some 0< φ<n
Relaxation:





If an item has count > φ n, it must be reported, together with its
estimated count with (absolute) error < εn
If an item has count < (φ − ε) n, it cannot be reported
For items in between, don’t care
In fact, we will solve the most difficult case φ = ε
Applications




6
Frequent IP addresses
Data mining
The algorithm

[Metwally, Agrawal, Abbadi, 2006]
Input parameters: number of counters m, stream S
 Algorithm:
for each element e in S {
if e is monitored {
find counter of e, counteri;
counteri++;
} else {
find the element with least frequency, em, denote its frequency min;
replace em with e;
assign counter for e with min+1;
}
}
7
Properties of the Algorithm
Actual count of a monitored item ≤ counter
Actual count of a monitored item ≥ counter – min
Actual count of an item not monitored ≤ min





Proof by induction
The sum of the counters maintained is n


Why?
So min <= n/m
If we set m = 1/ε, it’s sufficient to solve the heavy hitter
problem



8
Why?
So the heavy hitter problem can be solved in O(1/ε) space
How about deletions?
Any deterministic algorithm needs Ω(n) space




Why?
In fact, Las Vegas randomization doesn’t help
Will design a randomized algorithm that works with high
probability

9
For any item x, we can estimate its actual count within error εn
with probability 1-δ for any small constant δ
The Count-Min sketch
[Cormode, Muthukrishnan, 2003]
A Count-Min (CM) Sketch with parameters ( ,  ) is represented by
a two-dimensional array counts with width w and depth d : count[1,1] count[d , w]
Given parameters
2
1

( ,  ) , set w    and d  log .
 
 
Each entry of the array is initially zero.
d
hash functions are chosen uniformly at random from a 2-univeral family
For example, we can choose a prime number p > u, and random aj, bj, j=1,…,d. Define:
h1 ,, hd : {1n}  {1 w}
hj(x) = (aj x + bj mod p) mod w
Property: for any x ≠ y, Pr[hj(x)=hj(y)] ≤ 1/w
10
Updating the sketch
Update procedure :
When item x arrives, set 1  j  d
count[ j , h j ( x)]  count[ j , h j ( x)]  1
h1
x
hd










 1  

 1  

    1
   1 




When item x is deleted, do the same except changing +1 to -1
11





Estimating the count of x
aˆ x  min count[ j, h j ( x)]
Q (x )
j
actual count
Theorem 1
estimated count
ax  aˆ x
Pr[aˆ x  ax  n]  
12
Proof
We introduce indicator variables
I x, y, j 

( x  y )  (h j ( x)  h j ( y ))
1
if
0
otherwise
1 
E ( I x , y , j )  Pr[ h j ( x)  h j ( y )]  
w 2
Define the variable
I x, j   I x, y , j a y
y
By construction,
13
count[ j , h j ( x)]  a x  I x , j
min count[ j , h j (i)]  ai
For the other direction, observe that


E ( I x , j )  E  I x , y , j a y    a y E ( I x , y , j )  n / 2
 y
 y
Pr[ aˆ x  a x  n]  Pr[j , count[ j , h j ( x)]  a x  n]
 Pr[j , a x  I x , j  a x  n]
 Pr[j, I x, j  2E( I x, j )]  2d  
Markov inequality
Pr[ X  t ] 
E( X )
t
t  0
1
1
So, the Count-Min Sketch has size O log 


14
■