Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
OLAP Technology Studying the Cube By Operator Morfonios Constantinos July 2002 Introduction ■ ■ ■ ■ Database Systems have been very popular over the last decade Powerful Hardware and Software available in low prices Boring every-day tasks have been automated No more papers => Forms, spreadsheets, user-friendly interfaces Motivation ■ ■ ■ ■ ■ Much information is now digitized Data is used for every-day tasks: transactions, payments, receipts But what can be “hidden” behind all this information? Can data be useful in other ways? Answer: OLAP, Data Mining and Decision Support! What is OLAP? ■ ■ ■ ■ OLAP stands for On-Line Analytical Processing It comes in contrast to OLTP (OnLine Transaction Processing) It means that we don’t care any more about individual transactions What we are looking for is trends, statistics and interesting rules behind our data that can help us in business decisions And why RDBMSs are not enough? Since we have invested on … ■ Hardware ■ Software ■ Know-how ■ Employees And since this is a thoroughly studied and tested area Because of different needs… ■ ■ ■ ■ ■ ■ Historical vs. current data Subject oriented vs. application oriented Complex queries “touching” millions of tuples vs. small S-P-J queries Several scans vs. quick indexing on primary keys GB vs. MB Query throughput vs. transaction throughput Problems that arise ■ ■ ■ ■ ■ ■ Modeling Optimization Indexing Concurrency Recovery Administration New ideas ■ ■ ■ ■ ■ ■ Incremental updating (how/when?) Parallelism Use of multiple sources (even from the Web) Cleaning, transformation and integration of data Materialized views Visualization Data Warehouses ■ ■ ■ The new database technology that supports OLAP processing is the Data Warehouse It supports data mining and decision support applications New conceptual model fits better => Multidimensional Model Multidimensional Model ■ ■ We are interested in numeric values, called “measures” (revenue, number of sold items, cost, …) Measures are defined uniquely by “dimensions” that provide their context (time, store, customer, supplier, product, …). They have their own attributes and may form hierarchies. Example Quantity of product p, bought on date d by customer c. Quantity is a point in the 3d space. Time d p c Customer Product Why this model? ■ ■ Decision makers and business executives are used to work with spreadsheets. Common operations: Pivot ◆ Roll-up ◆ Drill-down ◆ Slice-and-Dice ◆ Ranking ◆ And so comes the CUBE! ■ ■ ■ Most operations need multiple aggregations Decision support may need information about revenue per individual customer, per store, per city, per week, per season, per week and season, … All possible combinations of dimensions and not only since they can be computed on different levels of detail. What is the CUBE? ■ ■ It is the 2N extension of group by It computes all possible group bys Example SELECT A, B, C, SUM(D) FROM X CUBE BY A, B, C SELECT A, B, C, SUM(D) FROM X GROUP BY A, B, C UNION SELECT A, B, SUM(D) FROM X GROUP BY A, B UNION SELECT A, C, SUM(D) FROM X GROUP BY A, C UNION SELECT B, C, SUM(D) FROM X GROUP BY B, C UNION … SELECT SUM(D) FROM X Naïve Solution ■ ■ ■ Compute all possible group bys separately and then compute the union of individual results If Ci is the cardinality of dimension i then the cube will have size (C1+1)· (C2+1) ··· (CN+1) First thought: This is not much bigger than the original data But… ■ ■ ■ ■ The original data is usually sparse This leads to greater difference All these group bys seem so similar… There must be something smarter to do… How is the group by computed? Three basic methods: ■ Nested loops ■ Sorting ■ Hashing The key idea is to bring identical values together. Then, they can be aggregated easily through one scan Heuristic ■ If we could exploit sorting or hashing performed for the computation of one group by in order to compute another, this would be more efficient! Group Bys form a lattice ■ ■ Each node represents a group by operation Each (directed) arc shows a parentchild relationship. A node could be computed by all its previous ancestors Example ABC AB AC BC A B C Ø Five optimizations ■ ■ ■ ■ ■ Smallest-parent Cache-results Amortize-scans Share-sorts Share-partitions Smallest-parent ■ ■ ■ Compute a node from its smallest ancestor that has been already computed Less aggregating attributes => more aggregation => less tuples “A” can be computed by “ABC”, “AB” or “AC”. Choose the smallest Cache-results ■ ■ ■ ■ Hold in main memory as many computed group bys as possible This will save I/O cost which is the bottleneck Ideally, if everything could be stored in main memory, the cube would be computed through one scan But this is not feasible in real data… Amortize-scans ■ ■ ■ ■ ■ ■ Create an “optimum” plan for the computation of group bys (NP-hard problem) This is equivalent to pruning the lattice and creating tree Try to have as many group bys as possible in main memory Breadth-first seems not to be efficient Perhaps depth-first? Or other heuristics… Share-sorts ■ ■ ■ ■ ■ Share the cost of sorting among multiple group bys that need it Applies only to sort-based methods Extensive use of pipelining If you sort in order “ABC”, then the result is also sorted in order “AB” and “A” Find common prefixes or partially matching sort orders Share-partitions ■ ■ ■ Share the cost of partitioning among multiple group bys that need it Applies only to hash-based methods Extensive use of pipelining But… ■ ■ ■ These five optimizations can be contradictory For example share-sorts implies to use “AB” for the computation of “A”. But what if “AC” is much smaller than “AB”? So, a lot of methods have been proposed. Each one of them is a version of some combination of the five ideas Algorithms ■ ■ ■ ■ ■ ■ PipeSort Overlap PipeHash PartitionedCube BottomUpCube … New ideas and relevant topics ■ ■ ■ ■ Multi-way array (value-based vs. position-based algorithms) Lossy methods (e.g. using wavelets) if you don’t care about accuracy Materialize as much as possible. But which nodes to select? And if you materialize some, how should queries be evaluated?