Download ppt slides

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth Sivaram Vallath Why Approximate query Answering?     Most applications are OLAP and data mining for analyzing large databases. Most of the queries are aggregation queries on these large databases. Hence expensive and resource intensive. Approximate answers given accurately and efficiently benefits the scalability of the application. Approaches to Approximation/Data Reduction   Use pre-computed samples of data instead of complete data Sampling:     Weighted Sampling Congressional sampling On the fly sampling (error prone when selections, GROUP BYs and joins are used) Workload (Deterministic solution for identical workloads) Approaches to Approximation/Data Reduction (contd.)    “similar” workloads, are considered as optimization problem (minimizing the error) Histograms Wavelets Attacking “similar” workloads   Stratified sampling Minimize error in estimation of aggregates Drawbacks of previous studies    Lack of rigorous problem formulations leds to solutions that are difficult to evaluate theoretically. Does not deal with uncertainty in incoming queries that are “similar” but identical Ignore the variance in data distribution of aggregated columns Architecture for AQP  Queries with selections, foreignkey joins and GROUP BY, containing aggregation functions such as COUNT, SUM, AVG. Architecture for AQP Architecture for AQP   Offline Component: building of sample Online Component: 1. 2. Rewrites an incoming query to use the sample to answer the query approximately Reports the answer with error estimates Pre-computing Samples for Fixed Workload  Fundamental Regions: For a given relation R and workload W, consider partitioning the records in R into a minimum number of regions R1, R2, …, Rr such that for any region Rj, each query in W selects either all records in Rj or none. Fixed Workload 1. Identify all fundamental regions → r After step 1, Case A (r ≤ k) and Case B (r >k) (k → sample size) 2. 3. 2. 3. Case A (r ≤ k): (Pick Sample records) Selects the samples by picking exactly one record from each important fundamental region Assigns appropriate values to additional columns in the sample records Case B (r > k): (Pick Sample records) select k regions and then pick one record from each of the selected regions. The heuristic is to select top k. Assign Values to Additional Columns. This is an optimization problem, which is solved by partially differentiating and the resulting linear equations using Gauss-Seidel method. Disadvantage   Per-query (probabilistic) error guarantee is not possible. If incoming query is not identical to a query in the given workload, FIXED can result in unpredictable errors. Lifting Workload to Query Distributions    Should be resilient to situations where “similar” but not identical queries “Similar”ity is not based on syntax. If the records returned by the two queries have significant overlap, it is similar. Each record will have a probability associated with it such that the incoming query will select this record Rationale for Stratified Sampling Minimize the MSE of the lifted workload  An effective scheme for stratification should be such that the expected variance, over all queries in each stratum, is small, and allocate more samples to strata with larger expected variances. Example ProductID Revenue 1 10 2 10 3 10 4 1000 Solution for single-table selection queries with Aggregation 1. Stratification 1. 2. 2. Allocation 1. 3. How many strata required during partition? How many records should each strata have? Determine the number of samples required across each strata Sampling Pragmatic Issues     Identifying Fundamental Regions Handling Large Number of Fundamental Regions Obtaining Integer Solutions Obtaining an Unbiased Estimator Extensions for more General Queries    GROUP BY Queries JOIN Queries Other Extensions  Mix of COUNT and SUM queries Conclusion  The solutions FIXED and STRAT handles the problems of data variance, heterogeneous mixes of queries, GROUP BY and foreign-key joins.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ppt slides