Download Dense-Region Based Compact Data Cube

Dense-Region Based Compact Data Cube Presented by: Kan Kin Fai Outline Background Introduction to Compact Data Cube Pros and cons of the Compact Data Cube method Dense-Region Based Compact Data Cube Background Why is a data cube? – Some pre-computed aggregates on the underlying data warehouse. System constraints on materializing data cube(s) – Disk space, maintenance cost, etc. Common approach: materialize parts of a data cube. Alternative: use approximation technique – Reason: OLAP applications accept approximate answers in many scenarios. Introduction to Compact Data Cube Compact Data Cube was proposed by Vitter and Wang in Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets (SIGMOD 99). Main Ideas – Offline phase: perform Haar wavelet transform on the underlying data (i.e. the base cuboid) and store the k most significant coefficients. – Online phase: process any given query based on the k most significant coefficients. Introduction to Compact Data Cube Basics of Haar wavelet transform Building Compact Data Cube Thresholding and Ranking Answering On-Line Queries Introduction to Compact Data Cube Basics of Haar wavelet transform – e.g. S = [2, 2, 0, 2, 3, 5, 4, 4] Resolution 8 4 2 1 Averages [2, 2, 0, 2, 3, 5, 4, 4] [2, 1, 4, 4] [1.5, 4] [2.75] Detail Coefficients [0, -1, -1, 0] [0.5, 0] [-1.25] Introduction to Compact Data Cube Basics of Haar wavelet transform – For compression reasons, the detail coefficients are normalized. – The coefficients at the lower resolutions are weighted more heavily. – Approximates the original signal by keeping only the most significant coefficients. – Requires only O(N) CPU time and O(N/B) I/Os to compute for a signal of N values. – Multidimensional wavelet transform: a series of onedimensional wavelet transforms. Introduction to Compact Data Cube Building the Compact Data Cube – Problem 1: the size of the multidimensional array representing the underlying data is too large (assume the data are very sparse). – Solution: Divide the wavelet transform process into multiple passes. Introduction to Compact Data Cube Building the Compact Data Cube – Problem 2: The density of the intermediate results would increase from pass to pass. – Solution: truncate the intermediate multidimensional array by cutting off entries with small magnitude. – I/O complexity: Nz N O( log M / B ) B B Introduction to Compact Data Cube Thresholding and Ranking – Choice 1: keep the C largest (in absolute value) wavelet coefficients. – Choice 2: keep the C wavelet coefficients with the largest weights among the C’ largest coefficients (C < C’). • The weight of a coefficient equals to the number of its dimensions with value zero. Introduction to Compact Data Cube Answering On-Line Queries – Space: ((d+1)k), CPU time:  2d ' min k ,2 1 i d' log Di  Pros and cons of the compact data cube method Pros: – Requires little disk spaces (a small number of disk blocks). – Responds to on-line query fast. – Answers OLAP queries more accurately than other approximation techniques like histogram and random sampling. – Can progressively refine the approximate answer with no added overhead. Pros and cons of the compact data cube method Cons: – Approximates a vast amount of useless empty cells in base cuboid together with useful non-empty cells in base cuboid. – Needs to cut off entries with small magnitude at the end of each pass in order to maintain a constant amount of I/O operations from pass to pass. Dense-Region Based Compact Data Cube Aim: – To enhance the ability of the compact data cube method to handle datasets having dense-regions-in-sparse-cube property. Main Idea: – To exclude empty cells in base cuboid from approximation. Two-phase approach – Compute dense regions in base cuboid. – Approximate each dense region independently. Dense-Region Based Compact Data Cube Question 1: how can we find the dense regions efficiently? – Efficient DEnse region Mining (EDEM) algorithm proposed by Cheung et al. in DROLAP -- A DenseRegion-Based Approach to On-line Analytical Processing (DEXA99) Dense-Region Based Compact Data Cube – Basic ideas of EDEM: • Build a k-d tree to store the valid cells. • Grow dense region covers along boundaries. • Search dense regions among the covers. – Complexity of EDEM: linear to the number of dimensions and sub-quadratic to the number of data points. Dense-Region Based Compact Data Cube Question 2: how should we allocate disk space in approximating the dense regions? – Choice 1: allocate disk space equally to each dense regions. – Choice 2: allocate disk space according to the sizes of dense regions. – Choice 3: order the wavelet coefficients of all the dense regions and keep the most significant ones (in absolute value). Dense-Region Based Compact Data Cube Question 3: how should we treat the data points outside the dense regions? – Keep all or keep only significant ones. Question 4: how do we answer on-line queries using the dense-region based approach? – Check if a dense region covered by the given query. – Check if the stored coefficients contribute to the range sum and compute the amount of contribution if needed. Dense-Region Based Compact Data Cube One favorable side effect: – we may parallelize the construction of compact data cube. More questions: – How can we handle updates to the underlying data? – How can we approximate iceberg cube? Can we apply the idea of compact data cube to iceberg cube? – Can compact data cube be used to answer other types of OLAP queries besides range-sum?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Dense-Region Based Compact Data Cube