Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Augmented Sketch: Faster and More Accurate Stream Processing Pratanu Roy Arijit Khan Gustavo Alonso Systems Group ETH Zurich Nanyang Technical University Singapore Systems Group ETH Zurich Data Stream Processing f( e ) = 3 …. f( e ) = 2 e a a c e e Data Stream IP traffic, phone calls, sensor measurements, web clicks and crawls P. Roy, A. Khan, G. Alonso 1/10 Data Stream Processing f( e ) = 3 …. f( e ) = 2 e a a c e e Data Stream IP traffic, phone calls, sensor measurements, web clicks and crawls Applications: • Load balancing • Ranking • Frequent itemsets mining • Classification P. Roy, A. Khan, G. Alonso 1/10 Challenges in Stream Processing Trade-off among Space, Accuracy, and Efficiency: -- Increasing space increases accuracy, but reduces throughput Other requirements: -- Build summary in one pass over the stream -- Incremental updates in summary P. Roy, A. Khan, G. Alonso 2/10 Related Work Sketch Space-saving Wavelets Sampling h (e,f) +f H1(e) +f Hw(e) w +f Count-Min Sketch P. Roy, A. Khan, G. Alonso 3/10 Our Motivation Improve accuracy for frequent items -- Critical for threshold checking (service-level agreement ) , ranking, load-balancing Reduce misclassification -- Frequent items mining the first step of frequent itemsets mining -- Even a small number of misclassified items lead to a large number of false-positive itemsets Improve throughput P. Roy, A. Khan, G. Alonso 3/10 Main Takeaway Frequency Hot data Cold data Items Let the common case go faster P. Roy, A. Khan, G. Alonso 4/10 Main Takeaway: Solution Framework Hot data Input Optimized Codepath (Filter) Cold data State-of-the-art Sketch Algorithms Optimized codepath acts like a filter for hot data Improvements: accuracy, throughput P. Roy, A. Khan, G. Alonso 4/10 Main Takeaway: Desired Outcome Throughput Solution framework State of the art Skew P. Roy, A. Khan, G. Alonso 4/10 Augmented Sketch (ASketch) Items with lower count Input Filter Very small (~0.3%) and make the processing very fast (with SIMD) Count Min Items with higher count estimate(k) > minimum in (filter) Challenges -- Removing items from sketch -- Cascading exchanges P. Roy, A. Khan, G. Alonso 5/10 Exchange Mechanism Count-Min filter A A 78 B Items 10 new 2 1 old 4 3 8 7 H1 6 4 6 9 H2 Found in filter Update in filter Count-Min filter C A B Items 8 10 new 2 1 old 4 3 89 7 H1 6 4 6 9 10 H2 Not found in filter Update in count-min estimate (C) > minimum count (filter) initiate an exchange P. Roy, A. Khan, G. Alonso 6/10 Exchange Mechanism Count-Min filter A C A B Items 8 89 10 new 2 29 1 old 4 3 99 7 H1 6 4 6 10 10 H2 Step 1: Move C to filter Count-Min filter C B Items 9 10 new 9 1 old 4 3 9 137 H1 6 104 6 10 H2 Step 2: Move A to Count-Min with (8-2) = 6 We do not perform multiple exchanges P. Roy, A. Khan, G. Alonso 6/10 Other Technical Contributions Theoretical Error Bounds Four different filter implementation -- Array, Strict Heap, Relaxed Heap, Stream Summary Hardware-conscious filter (SIMD) Pipeline parallelism SPMD parallelism P. Roy, A. Khan, G. Alonso 7/10 Throughput ( million items/sec) Experimental Results: Stream Processing Throughput 128 Asketch 64 FCM 32 Count-Min Count-Min [Cormode et al., Algo 05] Frequency Aware Counting (FCM) [Thomas et al, ICDE 09] Holistic UDAFs [Cormode et al, SIGMOD 04] holistic UDAFS 16 8 4 0.0 1.0 2.0 Skew (Zipf parameter z) 3.0 Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso 8/10 Experimental Results: Query Processing Throughput Throughput (million items/sec) 128 Asketch 64 FCM Count-Min 32 holistic UDAFs 16 8 4 0.0 0.5 1.0 1.5 2.0 Skew (Zipf parameter z) 2.5 3.0 Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 9/10 Experimental Results: Accuracy Improvement Observed error (%) 4 3 2 1 0 Count-Min ASketch Holistic UDAFs FCM Asketch-FCM • IP-Trace, 13M data items, stream size = 461M, sketch size = 128 KB, zipf 0.9 • Queries are generated by sampling the input distribution P. Roy, A. Khan, G. Alonso 10/10 Conclusions ASketch dynamically identifies and aggregates most frequent items -- Improves throughput and accuracy of existing sketches -- Allows efficient utilization of modern hardware features such as SIMD, multi-cores, etc. Future work: investigate the use of Asketch in the context of machine learning and data mining applications P. Roy, A. Khan, G. Alonso Impact of Misclassifications Average relative error Count-Min ASketch 1000000 10000 100 1 16KB 24KB Sketch Size 32KB 8M data items; stream size = 32M, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso Dataflow with Pipeline Parallelism Input Core 1 Core 2 Optimized codepath Count-Min Use message passing to communicate across the cores P. Roy, A. Khan, G. Alonso Parallel-ASketch vs ASketch Throughput (million items/sec) 256 Parallel-ASketch 128 ASketch 64 32 16 8 4 0.0 0.5 1.0 1.5 2.0 Skew (Zipf parameter z) 2.5 3.0 Synthetic data, 8M data items, stream size: 32M, sketch size = 128 KB, filter size = 0.4 KB P. Roy, A. Khan, G. Alonso