Download apweb09

Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA) Outline 2   Background and Preliminaries Probabilistic Threshold Range Aggregate Query Exact query processing  Approximate query processing: Simple Sampling & Double Sampling    Experiments Conclusion DB@UNSW Applications 3  Many applications involve data that is imperfect due to      data randomness and incompleteness limitation of equipment delay or lose in data transfer …… Applications      Sensor networks Environmental surveillance Moving objects Data cleaning and integration …… DB@UNSW Applications 4  Sensor Networks:  Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08) DB@UNSW Applications 5  Mobile Equipments / Moving Objects  A mobile object reports its location periodically, the exact location is often uncertain. DB@UNSW Applications 6  Satellite data DB@UNSW Applications 7  Data Quality  Social Data Collection: Errors and estimation inherent in customer surveys and sampling DBG @ UNSW Outline 8  Background and Preliminaries  Modeling   Uncertainty & Related Work Probabilistic Threshold Range Query Conclusion DB@UNSW Modeling Uncertainty ( cont. ) 9  1. Uncertain Objects Model Continuous case: described using a probability density function (PDF) fU such that u  U f U (u )du  1. E.g., uniform distribution, normal distribution. DB@UNSW Modeling Uncertainty ( cont. ) 10  2. Uncertain Objects Model Discrete case : described using a set of instances each instance u has an occurrence probability pu  u U pu  1 DB@UNSW Possible World Semantics 11  Given a set of uncertain objects {U1,U2, ..., Un}, a possible world W = {u1,u2, .., un} is a set of n instances --- one instance per uncertain object  The probability of a possible worlds is  n P(W) =  i 1 P(ui ) Let Ω be the set of all possible world, clearly,  P ( W )  1 W  DB@UNSW Probabilistic Queries: 12  Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07]  Aggregate Queries [BDJR05, MJ07, CG07]      Join Queries [CSP06, AW07] Top-k queries [SIC07, YLSK08, RDS07, HJZL08] Nearest Neighbor Queries [KKR07, CCMC08] Skyline Queries [PJLY07] …… DB@UNSW Range query 13   Uncertain objects, exact query Probability threshold is often assigned DBG @ UNSW Related Work 14 Range Queries [TCXNKP05, BPS06, AY08]  Given a rectangle r and a probabilistic threshold t , find all objects that appear in r with probability at least t. Appearance probability  xo. regionr r o. pdf ( x)dx o.region DB@UNSW U-tree 15 Probabilistically Constrained Region ( PCR ) PCR (0.2) Multi PCRs DB@UNSW [TCXNKP05] Outline 16     Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate Query (PTRA) Conclusion DB@UNSW Contribution 17    Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample techniques for approximate answer. DB@UNSW Problem Statement 18 Given a set of uncertain objects and query q , return the number of uncertain objects with appearance probability no less than threshold pq DB@UNSW Problem Definition 19 Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b) DB@UNSW Exact Query Processing ( aU-Tree) 20  Main idea: add aggregate information on U-tree  Advantage: stop at intermediate level if pruned or fully covered by the query  Disadvantage: otherwise, still need to drill down to the leaf nodes.  For a large portion of uncertain objects, appearance probability needs to be computed  Expensive for a massive number of instances per object! DB@UNSW Exact Query Processing ( aU-Tree) 21 DB@UNSW singleSample 22   Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m DB@UNSW singleSample ( cont. ) 23 An immediate application of Chernoff-Hoeffding bound DB@UNSW doubleSample 24  Single Sampling is expensive when there is a massive number of objects! Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects.  DB@UNSW doubleSample: Accuracy 25 •Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. •Using Chernoff-Hoeffding bound. DB@UNSW doubleSample: Our Approach 26     Skew! Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.) The optimization problem is NP-hard. Observation:  Min-skew is a good heuristic to conduct such a group.  aU-tree groups objects with a similar principle to the minskew. DB@UNSW doubleSample: Our Approach 27  Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard!     Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K. Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.) Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object. DB@UNSW Experiments 28 Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz , 2G memory, Debian linux, Page size DB@UNSW 8K Efficiency 29 DB@UNSW Accuracy 30 DB@UNSW Accuracy ( cont. ) 31 DB@UNSW Conclusion 32     Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with theoretic guarantee? DB@UNSW 33 Thanks DB@UNSW Min-Skew technique 34 DB@UNSW

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download apweb09