Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Range-Efficient Counting of Distinct
Elements
Srikanta Tirthapura
Iowa State University
(joint work with Phillip Gibbons, Aduri Pavan)
Range-Efficient F0
Stream:
[100,200], [0,10], [60, 120], [5,25]
0
5
10
25
60
100
120
200
F0:
|[0,25] U [60,200]| = 167
5/24/2017
IIT Kanpur Streams Workshop
2
Range-Efficient F0
Input Stream:
Sequence of ranges [l1,r1], [l2,r2] … [lm,rm]
for each i, 0 <= li <= ri <= n, and li, ri are integers
Output:
Return | [l1,r1] U [l2,r2] U … U [lm,rm]|
i.e. number of distinct elements in the union (F0)
Constraints:
• Single pass through the data
• Small Workspace
• Fast Processing Time
5/24/2017
IIT Kanpur Streams Workshop
3
Reductions to Range-Efficient F0
Duplicate
Insensitive
Sum
Max-Dominance
Norm
Range-Efficient F0
Counting
Triangles
in Graphs
5/24/2017
IIT Kanpur Streams Workshop
4
Duplicate-Insensitive Sum
Problem: Sum of all distinct elements in a stream of integers
Input Stream:
Sequence of integers S = a1,a2,….., an
Output:
distinct ai in S ai
Example:
S = 4, 5, 15, 4, 100, 4, 16, 15
Distinct Elements = 4,5,15,100, 16
Sum = 140
5/24/2017
IIT Kanpur Streams Workshop
5
Reduction from Dup-Insensitive Sum to F0
Stream from U = [0,m-1]
S
S’
4
5
15
[4m, 4m+1, .., 4m+3]
[5m,..,5m+4]
[15m,…,15m+14]
[4m,…,4m+3]
[100m,…,100m+99]
[4m,…,4m+3]
[15m,…,15m+14]
4
100
4
15
Duplicate-Insensitive Sum
5/24/2017
Alternate Stream from U’=[0,m2-1]
Number of Distinct Elements
IIT Kanpur Streams Workshop
6
Max Dominance Norm
Given k streams of m integers each, (the elements of the streams arrive in an
arbitrary order), where
1 ≤ ai,j ≤ n
a1,1 a1,2 .. a1,m
a2,1 a2,2 … a2,m
…
ak,1 ak,2 … ak,m
Return
a
j=1m max1 ≤ i ≤ k ai,j
5/24/2017
b
IIT Kanpur Streams Workshop
7
Reduction From Max Dominance Norm
•
Input stream I, output stream O:
F0 of Output Stream =
Dominance Norm of Input Stream
•
Assign ranges to the k positions:
[1,n] [n+1,2n] … [(k-1)n+1, kn]
•
When element ai,j is received, generate
the range
b
[(j-1)m+1, (j-1)m+1+ai,j]
•
a
Observation: F0 of the resulting stream
of ranges is the dominance norm of the
input stream
5/24/2017
IIT Kanpur Streams Workshop
8
Talk Outline
• Range Efficient F0
– Reductions Among Data Stream Problems
• Algorithm for Range Efficient F0 (building
on distinct sampling)
• Update Streams
• Open Questions
5/24/2017
IIT Kanpur Streams Workshop
9
Counting Distinct Elements (F0)
• Example
– How many different users accessed my website today?
– Stream = 1,1,2,3,4,1,2
F0 = 4
• Numerous Applications in databases and networking
• Prior Work
–
–
–
–
–
5/24/2017
Flajolet-Martin (1985)
Alon, Matias and Szegedy (1996)
Gibbons and Tirthapura (2001)
Bar-Yossef et al. (2002) (currently most space-efficient)
Indyk-Woodruff (2003) (Lower Bounds)
IIT Kanpur Streams Workshop
10
Range-Efficient F0
(Pavan and Tirthapura)
Distinct Sampling
Algorithm for F0
5/24/2017
+
Range Sampling
for 2-way Independent
Hash Functions
IIT Kanpur Streams Workshop
11
Sampling Based Algorithm for F0
(Gibbons and Tirthapura 2001)
U = {1,2,3,…..,n}
S0
p=1/2
{2,4,7,…}
S1
p=1/2
{4,7,11,..}
5/24/2017
D = Distinct
Elements In Stream
S0, S1, S2..
stored implicitly
implicitly using
hash functions
S2
IIT Kanpur Streams Workshop
D S1
D S2
12
Distinct Sampling
Sample = {}, p = 1
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
13
Distinct Sampling
5
Sample = {5}, p = 1
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
14
Distinct Sampling
5 3
Sample = {5,3}, p = 1
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
15
Distinct Sampling
5 3 7
Sample = {5,3,7}, p = 1
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
16
Distinct Sampling
5 3 7 5
Sample = {5,3,7}, p = 1
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
17
Distinct Sampling
5 3 7 5 6
Sample = {5,3,7,6}, p = 1
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
18
Distinct Sampling
5 3 7 5 6 8
Sample = {5,3,7,6,8}, p = 1
Overflow
Sample = Sample
S1
Sample = {3,6,8}, p = ½
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
19
Distinct Sampling
5 3 7 5 6 8 9
Sample = {3,6,8,9}, p= ½
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
20
Distinct Sampling
5 3 7 5 6 8 9 7
Same Decision for both
Sample = {3,6,8,9}, p= ½
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
21
Distinct Sampling
5 3 7 5 6 8 9 7 2
Sample = {3,6,8,9,2}, p= ½
Overflow
Sample = Sample
S2
Sample = {6,9}, p=¼
Target Workspace = 4 numbers
5/24/2017
IIT Kanpur Streams Workshop
22
Distinct Sampling
5 3 7 5 6 8 9 7 2 2 7 8 8 3 5
Finally,
Sample = {6,9}, p=¼
Estimate of F0 =
5/24/2017
(Sample Size)(4) = 8
IIT Kanpur Streams Workshop
23
Counting Distinct Elements
•
Finally, return a sample of distinct elements of
the stream of a “large enough” size
•
If target workspace = O((1/2)(log(1/))
integers, then estimate of F0 is a (, )approximation
•
Hash functions need only be pairwise
independent and can be stored in small space
5/24/2017
IIT Kanpur Streams Workshop
24
Sampling Using
Independent Coin Tosses
Distinct Sampling Using
Hash Functions
0
0
1
Hash
0
Function
1
5/24/2017
IIT Kanpur Streams Workshop
0
25
Adaptive Sampling for Range-Efficient F0
• Naïve Approach: Given range [x,y], successively
insert {x, x+1, … y} into F0 sampling algorithm
– Problem: Time per range very large
• Range-Sampling: Given stream element [p,q],
how to sample all elements in [p,q] quickly?
– At sampling level i, quickly compute |[p,q] ∩ Si|
5/24/2017
IIT Kanpur Streams Workshop
26
Hash Functions, and S0,S1,S2…
1
h(x)=(ax+b) mod p
p prime
a,b random in [0,p-1]
v3
v2
0
v1
p-1
n
5/24/2017
If h(x) Є[0,vi], then
x Є Si
IIT Kanpur Streams Workshop
27
Range Sampling
v
1
X1
0
X2
p-1
n
f(x)=(ax+b) mod p
Compute |{x Є [x1,x2] : f(x) Є [0,v] }|
5/24/2017
IIT Kanpur Streams Workshop
28
Arithmetic Progression
v
1
f(x1)
X1
0
X2
p-1
n
f(x1+1)
f(x)=(ax+b) mod p
Common Difference = a
5/24/2017
IIT Kanpur Streams Workshop
29
Low and High Revolutions
• Each revolution,
number of hits on [0,v]
is
v
f(x1)
0
– floor(v/a)
(low rev) p-1
– floor(v/a) +1 (high rev)
f(x1+1)
• Task: Count number of
low, high revolutions
5/24/2017
IIT Kanpur Streams Workshop
30
Starting Points of Revolutions
v
• Can find r = (v - v mod a)
such that:
– If starting point in [0,r], then
high revolution
– Else low revolution
r
f(x1)
0
p-1
f(x1+1)
• Task: Count the number of
revolutions with starting
point in [0,r]
5/24/2017
IIT Kanpur Streams Workshop
31
Recursive Algorithm
a
r
r
0
0
p-1
modulo p circle
a-1
modulo a circle
Observation: Starting Points form an Arithmetic Progression
with difference (- p mod a)
5/24/2017
IIT Kanpur Streams Workshop
32
Recursive Algorithm
• Focus on common difference
• Two Reductions Possible
Common Difference
a
Common Difference
a- (p mod a)
Common Difference
(p mod a)
At least one of the two common differences is smaller than a/2
5/24/2017
IIT Kanpur Streams Workshop
33
Range Sampling
Theorem: There is an algorithm for sampling range
[x,y] using 2-way independent hash functions with
– Time complexity O(log (y-x))
– Space Complexity O(log (y-x) + log m)
Plug back into distinct sampling to get rangeefficient F0 algorithm
5/24/2017
IIT Kanpur Streams Workshop
34
Input Stream
Results
Sequence of ranges
[l1,r1], [l2,r2] … [lm,rm]
for each i, 0 <= li <= ri < n, and li, ri are integers
Output
| [l1,r1] U [l2,r2] U … U [lm,rm]|
•
Randomized (,)-Approximation Algorithm for Rangeefficient F0 of a data stream
•
Processing Time (n is the size of the universe):
– Amortized processing time per interval:
O(log(1/) (log (n/)))
– Time to answer a query for F0 is a constant
•
Pavan,Tirthapura
SICOMP
(to appear)
WorkSpace: O((1/2)(log(1/)) (log n))
5/24/2017
IIT Kanpur Streams Workshop
35
Prior Work
• Bar-Yossef, Kumar, Sivakumar 2002
– First studied range-efficient F0
– Algorithms with higher space complexity
• Cormode, Muthukrishnan 2003
– Max-dominance Norm
• Nath, Gibbons, Seshan, Anderson 2004
– Duplicate-insensitive Sum assuming ideal hash
functions
5/24/2017
IIT Kanpur Streams Workshop
36
Comparison
Range-Efficient F0
Bar-Yossef et al.
Pavan and Tirthapura
Time
O(log5 n)(1/5)(log 1/)
O(log n + log 1/)(log 1/)
Space
O(1/3)(log n)(log 1/)
O(1/2)(log n)(log 1/)
Max-Dominance Norm
Cormode, Muthukrishnan
Pavan and Tirthapura
Time
O(1/4 ) (log n) (log m) (log 1/)
O(log n + log 1/)(log 1/)
Space
O (1/2)
(log n+1/ (log m) (log log m))
(log 1/)
O (1/2)
(log m+ log n)
(log 1/)
5/24/2017
IIT Kanpur Streams Workshop
37
Other Applications of Distinct
Sampling
1. Sample of distinct elements of the stream of
any desired target size
2. Approximate median of all distinct elements
in stream (duplicate insensitive median)
3. Distinct Frequent elements (“heavy hitters” in
network monitoring)
5/24/2017
IIT Kanpur Streams Workshop
38
Update Streams
• Insertions and Deletions of elements into the
streams
(11, +1), (7, +3), (4, +2), (7, -2), (11,-1)…
• Distinct Elements Problem: How many elements
have a positive cumulative weight?
– Assume a “sanity constraint”, no element has weight
less than 0
• Sampling algorithm described so far fails, since it
can only decrease sampling probability as stream
becomes larger
5/24/2017
IIT Kanpur Streams Workshop
39
Distinct Sampling on Update Streams
(three independent approaches)
• Sumit Ganguly, Minos N. Garofalakis, Rajeev Rastogi:
Processing Set Expressions over Continuous Update
Streams. SIGMOD 2003,
followed up by Ganguly, 2005 and Ganguly, Majumder
2006
• Graham Cormode, S. Muthukrishnan, Irina Rozenbaum:
Summarizing and Mining Inverse Distributions on Data
Streams via Dynamic Inverse Sampling. VLDB 2005
• Gereon Frahling, Piotr Indyk, Christian Sohler: Sampling
in dynamic data streams and applications. SocG 2005
5/24/2017
IIT Kanpur Streams Workshop
40
Distinct Elements on Update Streams
Use of K-Set Structure in storing samples
Ganguly, Garofalakis, Rastogi 2003
Ganguly 2005
Ganguly, Majumder 2006
5/24/2017
IIT Kanpur Streams Workshop
41
K-Set Structure
• Small space data structure for multi-set S (size Ỡ(K))
• Operations
• Insert (x,v) into S
• Delete (x,v’) from S
• Membership Query (is x in S?)
what is the number of distinct elements in S?
• If |S| ≤ K, then Queries answered correctly
K
Active
5/24/2017
Silent
IIT Kanpur Streams Workshop
Active
42
Counting Distinct Elements on Update
Streams
1. Sample Stream at different probabilities, 1, ½,
¼,…..
2. Store each of (D ∩ S0, D ∩ S1, D ∩ S2,…..) in a
k-set structure for an appropriate value of k
3. When queried, use the highest probability
sample that hasn’t overflowed yet
5/24/2017
IIT Kanpur Streams Workshop
43
Distributed Streams
Alice
Workspace = $$
Stream A
11 54 21 11 2 45 21 1…
Sketch(A)
Referee
Bob
Workspace = $$
1 5 21 2 54 21 35 …
Compute
Dup-Ins-Sum(A,B)
Sketch(B)
Stream B
5/24/2017
IIT Kanpur Streams Workshop
44
Summary
Range-Efficiency
(range-sampling)
Update Streams
(k-set structure)
Sliding Windows
(multiple samples)
Distinct Sampling
5/24/2017
IIT Kanpur Streams Workshop
45
Open Questions
• Can we efficiently handle higherdimensional ranges?
– Klee’s measure problem in streaming model
5/24/2017
IIT Kanpur Streams Workshop
46
Open Questions
• Range-Efficient F0 under update streams
• Duplicate-insensitive Fk (k ≥ 2), rangeefficient Fk
5/24/2017
IIT Kanpur Streams Workshop
47