Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Crossroads: A Practical Data Sketching Solution for Mining Intersection of Streams Jun Xu, Zhenglin Yu (Georgia Tech) Jia Wang, Zihui Ge, He Yan (AT&T Labs Research) Ashwin Lall (Denison University) Overview •Problem statement •Motivation •Our approach •Data collection architecture •Evaluation •Conclusions Problem Statement • Anonymously mine the logs of cellular data traffic to rapidly detect network performance anomalies • Can be converted to an association-rule mining problem if we store every cellular packet into a static database Motivation • Complexity of the network ecosystem (sophisticated phones, tablets; application server; a tremendous variety of apps and online services) • The performance issues can be introduced by different causes or combinations (device, app, network, app server) • Infeasible to store all combinations of such data (especially when it is collected in real-time) Problem Statement (detailed) An example event “Device = Magic Phone 7” & “OS = Magic OS 88.8” & “Application = FunContent.app” & “Source = Metropolis downtown location” & “Destination = FunContent.com” & “Time = τ” ⇒ “unusually long RTT”. Main challenge Cannot afford to store all the combinations (since the number of different attribute combinations is huge) Our Goal Asymptotic reduction in space usage while keeping accuracy loss small when detecting anomalous values Contribution of Paper An intersection scheme that can significantly reduce the storage cost while keep accuracy loss small. Our approach ● Based on data sketching solution (inspired by the tugof-war sketch) ● A sketch is constructed to succinctly summarize the performance metrics (e.g., average RTT) of all data items Our approach •Partition attributes to 2 groups •Example: group attributes to 2 groups: Ai , Bj Our approach •Ai :the set of packets that match (“Source = the Metropolis downtown location” & “Destination = FunContent.com” & “Time = τ”) •Bj:the set of packets that match “Device = Magic Phone 7” & “OS = Magic OS 88.8” & “Application = FunContent.app”) Our approach •We can compute functions on the intersection of arbitrary sets Ai and Bj Our approach ● Use sketches to store summary statistic (e.g., mean, variance) for Ai and Bj ● Derive the performance metrics of the data by intersecting the sketches ( Ai ∩ Bj ) Storage saving • Reduce the storage cost from O(n) to O(√n) • For example, number is in trillions (∼1012) for joint value combinations of all these attributes • Each subset may only be in millions (∼106) 3-Way Intersection ● How about 3-way intersection? ● An impossibility result: Data Collection Architecture Real data collection* * Note: No personally identifiable information (PII) was gathered or used in conducting this study. To the extent any data was analyzed, it was anonymous and/or aggregated data. Evaluation 8 different attributes in our real data*. Partition scheme: ● 1st group: RNC, service category, handheld device speed category, day of week ● 2nd group: handheld device manufacturer/model, content provider, access point network, hour * Note: No personally identifiable information (PII) was gathered or used in conducting this study. To the extent any data was analyzed, it was anonymous and/or aggregated data. Evaluation • 1.4 million distinct combinations for the 1st group • 1.5 million distinct combinations for the 2st group • Storage cost of maintaining the value of every combination: 1.4M × 1.5M × 4 bytes= 7.5 TB Evaluation • Using our intersection scheme, and using 4096 counters in each sketch, the space cost is (1.4M + 1.5M ) × 4096 × 4 bytes = 45 GB • Much less than 7.5 TB • Relative error will be about 10 % Evaluation ● As buckets number(memory usage) increases, average relative error will decrease. ● As intersection ratio increase, average relative error will decrease. Evaluation Results of mean relative errors when varying memory(buckets number) respectively for situations that intersection ratio = 0.01, 0.02, 0.05, 0.10 Conclusions • We provide an intersection scheme for estimating arbitrary summary statistics on large data sets • We show how to reduce storage cost from O(n) to O(√n) • We demonstrate efficacy using both synthetic and real data