Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA) Outline 2 Background and Preliminaries Probabilistic Threshold Range Aggregate Query Exact query processing Approximate query processing: Simple Sampling & Double Sampling Experiments Conclusion DB@UNSW Applications 3 Many applications involve data that is imperfect due to data randomness and incompleteness limitation of equipment delay or lose in data transfer …… Applications Sensor networks Environmental surveillance Moving objects Data cleaning and integration …… DB@UNSW Applications 4 Sensor Networks: Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08) DB@UNSW Applications 5 Mobile Equipments / Moving Objects A mobile object reports its location periodically, the exact location is often uncertain. DB@UNSW Applications 6 Satellite data DB@UNSW Applications 7 Data Quality Social Data Collection: Errors and estimation inherent in customer surveys and sampling DBG @ UNSW Outline 8 Background and Preliminaries Modeling Uncertainty & Related Work Probabilistic Threshold Range Query Conclusion DB@UNSW Modeling Uncertainty ( cont. ) 9 1. Uncertain Objects Model Continuous case: described using a probability density function (PDF) fU such that u U f U (u )du 1. E.g., uniform distribution, normal distribution. DB@UNSW Modeling Uncertainty ( cont. ) 10 2. Uncertain Objects Model Discrete case : described using a set of instances each instance u has an occurrence probability pu u U pu 1 DB@UNSW Possible World Semantics 11 Given a set of uncertain objects {U1,U2, ..., Un}, a possible world W = {u1,u2, .., un} is a set of n instances --- one instance per uncertain object The probability of a possible worlds is n P(W) = i 1 P(ui ) Let Ω be the set of all possible world, clearly, P ( W ) 1 W DB@UNSW Probabilistic Queries: 12 Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07] Aggregate Queries [BDJR05, MJ07, CG07] Join Queries [CSP06, AW07] Top-k queries [SIC07, YLSK08, RDS07, HJZL08] Nearest Neighbor Queries [KKR07, CCMC08] Skyline Queries [PJLY07] …… DB@UNSW Range query 13 Uncertain objects, exact query Probability threshold is often assigned DBG @ UNSW Related Work 14 Range Queries [TCXNKP05, BPS06, AY08] Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t. Appearance probability xo. regionr r o. pdf ( x)dx o.region DB@UNSW U-tree 15 Probabilistically Constrained Region ( PCR ) PCR (0.2) Multi PCRs DB@UNSW [TCXNKP05] Outline 16 Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate Query (PTRA) Conclusion DB@UNSW Contribution 17 Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample techniques for approximate answer. DB@UNSW Problem Statement 18 Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq DB@UNSW Problem Definition 19 Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b) DB@UNSW Exact Query Processing ( aU-Tree) 20 Main idea: add aggregate information on U-tree Advantage: stop at intermediate level if pruned or fully covered by the query Disadvantage: otherwise, still need to drill down to the leaf nodes. For a large portion of uncertain objects, appearance probability needs to be computed Expensive for a massive number of instances per object! DB@UNSW Exact Query Processing ( aU-Tree) 21 DB@UNSW singleSample 22 Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m DB@UNSW singleSample ( cont. ) 23 An immediate application of Chernoff-Hoeffding bound DB@UNSW doubleSample 24 Single Sampling is expensive when there is a massive number of objects! Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects. DB@UNSW doubleSample: Accuracy 25 •Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. •Using Chernoff-Hoeffding bound. DB@UNSW doubleSample: Our Approach 26 Skew! Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.) The optimization problem is NP-hard. Observation: Min-skew is a good heuristic to conduct such a group. aU-tree groups objects with a similar principle to the minskew. DB@UNSW doubleSample: Our Approach 27 Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K. Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.) Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object. DB@UNSW Experiments 28 Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size DB@UNSW 8K Efficiency 29 DB@UNSW Accuracy 30 DB@UNSW Accuracy ( cont. ) 31 DB@UNSW Conclusion 32 Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with theoretic guarantee? DB@UNSW 33 Thanks DB@UNSW Min-Skew technique 34 DB@UNSW