Download document 8069712

Efficient Processing of Massive Data Streams for Mining and Monitoring Mirek Riedewald Department of Computer Science Cornell University Acknowledgements Al Demers  Abhinandan Das  Alin Dobra  Sasha Evfimievski  Johannes Gehrke  KD-D initiative (Art Becker et al.)  Introduction  Data streams versus databases    Network monitoring   High arrival rates, approximation [CGJSS02] Stock trading   Infinite stream, continuous queries Limited resources Complex computation [ZS02] Retail, E-business, Intelligence, Medical Surveillance   Identify relevant information on-the-fly, archive for data mining Exact results, error guarantees Information Spheres  Local Information Sphere  Within each organization  Continuous processing of distributed data streams  Online evaluation of thousands of triggers  Storage/archival of important data  Global Information Sphere  Between organizations  Share data in privacy preserving way Local Information Sphere Distributed data stream event processing and online data mining  Technical challenges  Blocking operators, unbounded state  Graceful degradation under increasing load  Integration with archive  Processing of physically distributed streams Event Matching, Correlation  Join of data streams Brand Mpix Canon 3.0 Price Mpix Price 200 >2.0 <250 Event Matching, Correlation  Join of data streams Price Mpix Price Canon 3.0 200 >2.0 <250 Fuji 100 >4.0 <400 Brand Mpix 3.0 Event Matching, Correlation  Join of data streams Price Mpix Price Canon 3.0 180 > 2.0 < 250 Fuji 3.0 220 > 4.0 < 400 Kodak 4.0 340 = 3.0 < 200 Brand Mpix Equi-join, text similarity, geographical proximity,…  Problem: unbounded state, computation  Window Joins  Restrict join to window of most recent records (tuples)  Landmark window  Sliding window based on time or number of records  Problem definition  Window based on time: size w  Synchronous record arrival  Equi-join Abstract Model Data streams R(A,…), S(A,…)  Compute equi-join on A   Match all r and s of streams R, S such that r.A=s.A  Sliding window of size w R 1 1 1 S 2 3 1 (r0,s2), (r1,s2), (r2,s2) Abstract Model (cont.) Data streams R(A,…), S(A,…)  Compute equi-join on A   Match all r and s of streams R, S such that r.A=s.A  Sliding window of size w R 1 1 1 3 S 2 3 1 1 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) Abstract Model (cont.) Data streams R(A,…), S(A,…)  Compute equi-join on A   Match all r and s of streams R, S such that r.A=s.A  Sliding window of size w R 1 1 1 3 2 S 2 3 1 1 4 (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output Limited Resources Focus on limited memory M<2w  State of the art: random load shedding [KNV03]   Random sample of streams  Desired approach: semantic load shedding  Goal: graceful degradation  Approximation  Set-valued result: Error measure? Set-Approximation Error  What is a good error measure?  Information Retrieval, Statistics, Data Mining         Matching coefficient Dice coefficient Jaccard coefficient Cosine coefficient Overlap coefficient | A B | 2 | A  B | /(| A |  | B |) | A B | / | A B | | A B | / | A|  | B | | A B | / min{| A |, | B |} Earth Mover’s Distance (EMD) [RTG98] Match And Compare (MAC) [IP99] Join: subset of output result   EMD, Overlap coefficient trivially 0 or 1 Others (except MAC) reduce to MAX-subset error measure Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints Lightweight online technique  Adaptivity in presence of memory fluctuations  Optimal Offline Algorithm  What is the best possible that can be achieved?  Optimal sampling strategy for MAX-subset  Bottom-line for evaluation of any online algorithm  Same optimization problem, but knows future  Finite subsets of input streams  Formulate as linear flow problem Generation of Flow Model M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation S=2,3,1,1 cost Capacity: 0..1, linear cost -1 3 -3 -1 Keep in memory Replace Correspondence to Windows R=1,1,1,3 S=2,3,1,1 Correspondence to Windows R=1,1,1,3 S=2,3,1,1 Correspondence to Windows R=1,1,1,3 -1 -1 -1 S=2,3,1,1 Correspondence to Windows R=1,1,1,3 -1 -1 -1 -1 -1 S=2,3,1,1 -1 Complexity Integer solution exists  Optimal solution found in O(n2 m log n)  N input size of single stream  #nodes: n < 2wN + N + 2  #arcs: m < 2n + M + 1  Reasonable costs for benchmarking  Approx. 1GB memory (w=800, M=800)  Approx. 1h computation time Optimal Flow M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Fixed memory allocation S=2,3,1,1 cost Capacity: 0..1, linear cost -1 3 -3 -1 Keep in memory Replace Easy to Extend M=2, w=3 -1 R=1,1,1,3 -1 -1 -1 Variable memory allocation S=2,3,1,1 cost Capacity: 0..1, linear cost -1 3 -3 -1 Keep in memory Replace Online Heuristics  Maximize expected output  PROB: sort tuples by join partner arrival probability  LIFE: sort tuples by product of partner arrival probability and remaining lifetime  Maintain stream statistics  Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01) Approximation Quality Effect of Skew Summary    Information sphere architecture Optimal algorithm and fast efficient heuristic for sliding window joins Open problems      Other set error measures, resource models Other joins: compress records Complex queries Distributed processing Integration with other techniques into local information sphere Related Work Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI)  Memory requirements [ABBMW02,TM02]  Aggregation   Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy Other Results [DGR03]  Integration with archive    Load smoothing, not shedding Novel “error” measure: archive access cost Static join for sensor networks     Maximize result size subject to constraints on energy consumption Polynomial dynamic programming solution Fast 2-approximation algorithms NP-hardness proof for join of 3 or more streams Other Results (cont.) [DGGR02]  Computation of aggregates over streams for multiple joins  Small pseudo-random sketch synopses (randomized linear projections)  Explicit, tunable error guarantees  Sketch partitioning to boost accuracy (intelligently partition join attribute space) Thanks! ? ? ? Questions? ? ? ? ?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download document 8069712