Download Low-Latency Handshake Join - Nanyang Technological University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Augmented Sketch: Faster and More
Accurate Stream Processing
Pratanu Roy
Arijit Khan
Gustavo Alonso
Systems Group
ETH Zurich
Nanyang Technical University
Singapore
Systems Group
ETH Zurich
Data Stream Processing
f( e ) = 3
….
f( e ) = 2
e a a c e e
Data Stream
 IP traffic, phone calls, sensor measurements, web clicks and crawls
P. Roy, A. Khan, G. Alonso
1/10
Data Stream Processing
f( e ) = 3
….
f( e ) = 2
e a a c e e
Data Stream
 IP traffic, phone calls, sensor measurements, web clicks and crawls
 Applications:
• Load balancing
• Ranking
• Frequent itemsets mining
• Classification
P. Roy, A. Khan, G. Alonso
1/10
Challenges in Stream Processing
 Trade-off among Space, Accuracy, and Efficiency:
-- Increasing space increases accuracy, but reduces
throughput
 Other requirements:
-- Build summary in one pass over the stream
-- Incremental updates in summary
P. Roy, A. Khan, G. Alonso
2/10
Related Work




Sketch
Space-saving
Wavelets
Sampling
h
(e,f)
+f
H1(e)
+f
Hw(e)
w
+f
Count-Min Sketch
P. Roy, A. Khan, G. Alonso
3/10
Our Motivation
 Improve accuracy for frequent items
-- Critical for threshold checking (service-level agreement ) , ranking,
load-balancing
 Reduce misclassification
-- Frequent items mining the first step of frequent itemsets mining
-- Even a small number of misclassified items lead to a large number
of false-positive itemsets
 Improve throughput
P. Roy, A. Khan, G. Alonso
3/10
Main Takeaway
Frequency
Hot data
Cold data
Items
Let the common case go faster
P. Roy, A. Khan, G. Alonso
4/10
Main Takeaway: Solution Framework
Hot data
Input
Optimized
Codepath
(Filter)
Cold data
State-of-the-art
Sketch Algorithms
 Optimized codepath acts like a filter for hot data
 Improvements: accuracy, throughput
P. Roy, A. Khan, G. Alonso
4/10
Main Takeaway: Desired Outcome
Throughput
Solution framework
State of the art
Skew
P. Roy, A. Khan, G. Alonso
4/10
Augmented Sketch (ASketch)
Items with
lower count
Input
Filter
Very small (~0.3%)
and make the processing very
fast (with SIMD)
Count Min
Items with
higher count
estimate(k) > minimum in (filter)
 Challenges
-- Removing items from sketch
-- Cascading exchanges
P. Roy, A. Khan, G. Alonso
5/10
Exchange Mechanism
Count-Min
filter
A
A
78
B
Items
10
new
2
1
old
4
3
8
7
H1
6
4
6
9
H2
Found in filter
Update in filter
Count-Min
filter
C
A
B
Items
8
10
new
2
1
old
4
3
89
7
H1
6
4
6
9
10
H2
Not found in filter
Update in count-min
estimate (C) > minimum count (filter)  initiate an exchange
P. Roy, A. Khan, G. Alonso
6/10
Exchange Mechanism
Count-Min
filter
A
C
A
B
Items
8
89
10
new
2
29
1
old
4
3
99
7
H1
6
4
6
10
10
H2
Step 1: Move C to filter
Count-Min
filter
C
B
Items
9
10
new
9
1
old
4
3
9
137
H1
6
104
6
10
H2
Step 2: Move A to Count-Min with (8-2) = 6
We do not perform multiple exchanges
P. Roy, A. Khan, G. Alonso
6/10
Other Technical Contributions
 Theoretical Error Bounds
 Four different filter implementation
-- Array, Strict Heap, Relaxed Heap, Stream Summary
 Hardware-conscious filter (SIMD)
 Pipeline parallelism
 SPMD parallelism
P. Roy, A. Khan, G. Alonso
7/10
Throughput ( million items/sec)
Experimental Results: Stream Processing Throughput
128
Asketch
64
FCM
32
Count-Min

Count-Min
[Cormode et al., Algo 05]

Frequency Aware Counting
(FCM) [Thomas et al, ICDE 09]

Holistic UDAFs
[Cormode et al, SIGMOD 04]
holistic UDAFS
16
8
4
0.0
1.0
2.0
Skew (Zipf parameter z)
3.0
Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB
P. Roy, A. Khan, G. Alonso
8/10
Experimental Results: Query Processing Throughput
Throughput (million items/sec)
128
Asketch
64
FCM
Count-Min
32
holistic UDAFs
16
8
4
0.0
0.5
1.0
1.5
2.0
Skew (Zipf parameter z)
2.5
3.0
 Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB
 Queries are generated by sampling the input distribution
P. Roy, A. Khan, G. Alonso
9/10
Experimental Results: Accuracy Improvement
Observed error (%)
4
3
2
1
0
Count-Min
ASketch Holistic UDAFs
FCM
Asketch-FCM
• IP-Trace, 13M data items, stream size = 461M, sketch size = 128 KB, zipf 0.9
• Queries are generated by sampling the input distribution
P. Roy, A. Khan, G. Alonso
10/10
Conclusions
 ASketch dynamically identifies and aggregates most frequent items
-- Improves throughput and accuracy of existing sketches
-- Allows efficient utilization of modern hardware features such as
SIMD, multi-cores, etc.
 Future work: investigate the use of Asketch in the context of
machine learning and data mining applications
P. Roy, A. Khan, G. Alonso
Impact of Misclassifications
Average relative error
Count-Min
ASketch
1000000
10000
100
1
16KB
24KB
Sketch Size
32KB
8M data items; stream size = 32M, filter size = 0.4 KB
P. Roy, A. Khan, G. Alonso
Dataflow with Pipeline Parallelism
Input
Core 1
Core 2
Optimized
codepath
Count-Min
Use message passing to communicate across the cores
P. Roy, A. Khan, G. Alonso
Parallel-ASketch vs ASketch
Throughput (million items/sec)
256
Parallel-ASketch
128
ASketch
64
32
16
8
4
0.0
0.5
1.0
1.5
2.0
Skew (Zipf parameter z)
2.5
3.0
Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB
P. Roy, A. Khan, G. Alonso
Related documents