Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the 3rd National Conference; INDIACo m-2009 Co mputing For Nation Development, February 26 – 27, 2009 Bharati Vidyapeeth’s Institute of Co mputer Applications and Management, New Delh i Data Cubing Algorithms - Comparative Study of Data Cubing Algorithms Beena Mahar E-Mail: [email protected] ABSTRACT A data warehouse is a subject oriented, integrated, timevariant and nonvolatile collection of data organized in support of management decision-making. Several factors distinguish data warehouses from operational databases. Because the two systems provides quite different functionalities and require different kinds of data, it is necessary to maintain data warehouses separately from operational database. A data cube is the core of the multidimensional model, which consists of a large set of facts, measures and a number of dimensions. Data cube consists of lattice of cuboids, each corresponding to a different degree of summarization of the given multidimensional data. Data cube computation is essential task in data warehouse implementation. The precomputation of all or part of a data cube can greatly reduce the response time and enhance the performance of on-line analytical processing. There are several methods for cube computation, several strategies to cube materialization and some specific computation algorithms, namely Multiway array aggregation, BUC, C, Star Cubing the computation of shell fragments, and the computation of cubes involving complex measures. Full materialization refers to the Computation of all of the cuboids in the lattice defining a data cube. It typically requires an excessive amount of storage space, particularly as the number of dimensions and size of associated concepts hierarchies grow. This problem is known as the curse of dimensionality. Alternatively, partial materialization is the selective computation of a subset of the cuboids or sub cubes in the lattice. For example, an iceberg cube is a data cube that stores only those cube cells whose aggregate value (e.g. count) is above some minimum support threshold. We have selected to do a comparative study of the abovementioned specific data cubing algorithms as the core topic of our paper. Our paper is organized as follows: Firstly explanation of data cubes and various existing cubing algorithms and then detailed study of each cubing algorithms and at last the conclusion of our study. INTRODUCTION Data Cubes Data cubing is a process of computing the set of all possible group-by's fro m a base table. It facilitates many OLA P operations such as drilling-down or roll-up. Our package features several algorith ms for co mputing the full data cube (all group-by's), iceberg cubes (group-by's satisfying a minimu m support value), and closed cubes (group-by's which are not subsumed by any other group-by's ). Users of decision support systems often see data in the form of data cubes. The cube is used to represent data along some measure of interest. Although called a "cube", it can be 2dimensional, 3-dimensional, or higher-d imensional. Each dimension represents some attribute in the database and the cells in the data cube represent the measure of interest. For example, they could contain a count for the number of times that attribute combination occurs in the database, or the minimu m, maximu m, sum or average value of some attribute. Queries are performed on the cube to retrieve decision support information. Example : We have a database that contains transaction informat ion relat ing company sales of a part to a customer at a store location. The data cube formed fro m this database is a 3-dimensional representation, with each cell (p,c,s) of the cube representing a combination of values from part, customer and store-location. A sample data cube for this combination is shown in Figures(Fig a and b). The contents of each cell is the count of the number of times that specific combination of values occurs together in the database. Cells that appear blank in fact have a value of zero. The cube can then be used to retrieve informat ion within the database about, for example, which store should be given a certain part to sell in order to make the greatest sales. KEYWORDS Data Cubes, Data Cube Algorith ms, Star Cubing, MM Cubing, C, Multiway Array Aggregation, BUC Fig a): Front View of Samp le Data Cube Proceedings of the 3rd National Conference; INDIACo m-2009 Fig b): Ent ire View o f Samp le Data Cube Computed versus Stored Data Cubes The goal is to retrieve the decision support information fro m the data cube in the most efficient way possible. Three possible solutions are: Pre-co mpute all cells in the cube Pre-co mpute no cells Pre-co mpute some of the cells If the whole cube is pre -computed, then queries run on the cube will be very fast. The disadvantage is that the precomputed cube requires a lot of memory. The size o f a cube for n attributes D1 ,...,Dn with card inalities |D1 |,...,|Dn | is π|Di |. This size increases exponentially with the number of attributes and linearly with the cardinalit ies of th ose attributes. To minimize memory requirements, we can pre-co mpute none of the cells in the cube. The disadvantage here is that queries on the cube will run more slowly because the cube will need to be rebuilt for each query. As a compromise between these two, we can pre-co mpute only those cells in the cube which will most likely be used for decision support queries. The trade-off between memo ry space and computing time is called the space-time trade-off, and it often exists in data min ing and computer s cience in general. Data Algorithms Efficient co mputation of data cubes has been one of the focusing points in research since the introduction of data warehousing, OLAP, and data cube. Data cubing algorith ms mainly fall into 5 categories... (1) Co mputation of full or iceberg cubes with simple o r complex measures (2) Approximate co mputation of compressed data cubes, such as quasi-cubes, wavelet cubes, etc. (3) Closed cube computation with index structure, such as condensed, dwarf, or quotient cubes (4) Selective materia lization of views (5) Cubes computation in stream data for mult i-d imensional regression analysis D a t a C u be C o m p u t a t i o n Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels? Materialization of data cube Materialize every (cuboid) (fu ll materialization), none (no materializat ion), or some (partial materializat ion) Selection of which cuboids to materialize Based on size, sharing, access frequency, etc. C u be O pe r a t i o n Cube definit ion and computation in DMQL define cube sales[item, state, year]: sum(sales_in_dollars) co mpute cube sales Transform it into a SQL-like language SELECT item, state, year, SUM (amount) FROM SALES CUBE BY item, state, year Need compute the following Group-Bys (date, product Cube Operation ct, customer),(date, product),(date, customer), (product, customer),(date), (product), (customer) (item) (state) (state, item) (item) (state, year) (year) (item, year) (state, item, year) Fig : C u b e O p e r a t i o n Ef fi c i e n t C o m p u t ati o n of D at a Cu b e s Preliminary cube co mputation tricks Co mputing full/iceberg cubes: 3 methodologies o Top-Down: Multi-Way array aggregation o Bottom-Up: Bottom-up co mputation: BUC H-cubing technique o Integrating Top-Down and Bottom-Up : Star-cubing algorithm High-dimensional OLAP: A Minimal Cubing Approach Co mputing alternative kinds of cubes: Data Cubing Algorith ms - Co mparative Study of Data Cubing Algorith ms o Partial cube, closed cube, approximates cube, etc. cell, all B cost. B c 0 b 3 b 2 b 1 b 0 c3 c 4 c 22 5 1 9 B1 3 9 5 1 a 0 memory access and 6 63 6 4 2 4 4 4 3 6 31 7 32 8 0 1 1 16 4 4 5 2 4 8 4 2 0 4 3 2 6 2 3 4 0 a1 a2 a3 storage 6 1 6 0 5 6 5 2 A What is the best traversing order to do multi -way aggregation? A reduces C Preliminary Tricks Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Aggregates may be computed from p reviously computed aggregates, rather than from the base fact table Smallest-chil d: co mputing a cuboid from the smallest, previously computed cuboid Cache-results: caching results of a cuboid from which other cuboids are computed to reduce disk I/Os Amortize-scans: computing as many as possible cuboids at the same time to amo rtize disk reads Share-sorts: sharing sorting costs cross multip le cuboids when sort-based method is used Share-partiti ons: sharing the partitioning cost across mu ltip le cuboids when hash-based algorithms are used 1) Multi-Way Array Aggregation Array-based ―bottom-up‖ algorith m Using multi-dimensional chunks and C Method: the planes should be sorted and computed according to their size in ascending order Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small nu mber of dimensions If there are a large number of dimensions, ―top-down‖ computation and iceberg cube computation methods can be explored 2) Bottom-Up Computati on (B UC) AB AC BC ABC No direct tuple co mparisons Simu ltaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization Multi-way Array Aggregation for Cube Computation (MOLAP) Partit ion arrays into chunks (a small sub cube which fits in memory ). Co mpressed sparse array addressing: (chunk_id, offset) Co mpute aggregates in ―multiway‖ by visiting cube cells in the order which minimizes the # of times to visit each Div ides dimensions into partitions and facilitates iceberg pruning If a partition does not satisfy min_sup, its descendants can be pruned If minsup = 1 Þ co mpute full CUBE No simultaneous aggregation Proceedings of the 3rd National Conference; INDIACo m-2009 all b2 A AB ABC B AC AD ABD C d1 d2 D BC CD BD ACD BCD Fig a) BUC 3) Star-Cubing: An Integrating Method ABCD 1 all 2A 3 AB 4 ABC 7 AC 6 ABD 10 B 14 C 16 D 9 AD 11 BC 13 BD 8 ACD Integrate the top-down and bottom-up methods Explore shared dimensions E.g., dimension A is the shared dimension of ACD and AD ABD/AB means cuboid ABD has shared dimensions AB Allows for shared computations e.g., cuboid AB is co mputed simultaneously as ABD Aggregate in a top-down manner but with the bottom-up sub-layer underneath which will allow Apriori pruning Shared dimensions grow in bottom-up fashion 15 CD 12 BCD 5 ABCD Fig : BUC C/C D B UC: Partiti oning Usually, entire data set can’t fit in main memory Sort distinct values, partition into blocks that fit Continue processing Optimizations Partit ioning AC/AC ABC/ABC AD/A ABD/AB BC/BC BD/B ACD/A CD BCD ABCD/all Fig b): An Integrating Method-Star Cube External Sorting, Hashing, Counting Sort Ordering dimensions to encourage pruning Card inality, Skew, Correlation Collapsing duplicates Can’t do holistic aggregates anymore! Star-Cubing Algorithm—DFS on Lattice Tree Properties of Proposed Method Partit ions the data vertically Reduces high-dimensional cube into a set of lower dimensional cubes Online re-construction of original h igh-dimensional space Data Cubing Algorith ms - Co mparative Study of Data Cubing Algorith ms Lossless reduction Offers tradeoffs between the amount of preprocessing and the speed of online computation Only need to store one cell (a1, a2, …, a100, 10), which represents all the corresponding aggregate cells Further Imp lementation Considerations Adv. Fully precomputed cube without compression Incremental Update: Efficient condensed cube Closed cube Append more TIDs to inverted list Add <tid: measure> to ID_ measure table Bit map indexing May further imp rove space usage and speed Inverted index co mpression Store as d-gaps Exp lore more IR co mpression methods all BCD: 51 b*: 33 A /A B/B b1: 26 C/C D/D root: 5 c*: 14 c3: 211 AB/AB d*: 15 c*: 27 AC/AC AD/A BC/BC BD/B d4: 212 ABC/ABC ABD/AB CD a1: 3 a2: 2 d*: 28 ACD/A BCD b*: 1 b1: 2 b*: 2 c*: 1 c*: 2 c3: 2 d*: 1 d*: 2 d4: 2 ABCD Fig c): Star-Cubing Algorith m—DFS on Lattice Tree 4) Compressed Cubes: Condensed or Cl osed Cubes min imal C-Cubing Form new inverted list and add new frag ments of the Incremental adding new d imensions computation Icerberg cube cannot solve all the problems Suppose 100 dimensions, only 1 base cell with count = 10. How many aggregate (non-base) cells if count >= 10? Condensed cube CONCLUS ION Cube computation factorizes the lattice space. It is well recognized that data cubing often produces huge outputs. Two popular efforts devoted to this problem are (1) iceberg cube, where only significant cells are kept, and (2) closed cube, where a group of cells which preserve roll-up/drill-down semantics are loss-lessly compressed to one cell. Due to its usability and importance, efficient computation of closed cubes still warrants a thorough study. MM-Cubing performs well on an extensive set of data. MMCubing algorithm, which first factorizes the lattice space and then computes (1) the dense subspace by simu ltaneous aggregation, and (2) the sparse subspaces by recursive calls to itself. For the unifo rm data distribution, MM Cubing is almost the same as the better one of Star-Cubing and BUC and is significantly better than the worse one. When the data is skewed, MM-Cubing is better than both. Thus MM-Cubing is the only cubing algorithm so far that has uniformly high performance in all the data distributions. MM-Cubing performs best when major values in different dimensions are correlated (appear in the same tuples), since in this case the dense subspace will be very dense and the simu ltaneous aggregation is extremely efficient. The experiments are all based on data with dimensional independence. The worst case for MM-Cubing will be that major values in one dimension are always correlated with minor values in other dimensions. However, although sometimes it may happen in real datasets, it is highly unlikely that it holds for all factorizations in the recursive calls. Even in this case, MM-Cubing won’t perform much worse than BUC, since the co mputation time for the dense subspace is small co mpared to recursive calls. BUC emp loys a bottom-up co mputation by expanding dimensions. Cuboids with fewer d imensions are parents of those with more dimensions. BUC starts by reading the first dimension and partitioning it based on its distinct values. For each partition, it recursively computes the remaining dimensions. The bottom-up computation order facilitates the Apriori-based pruning: The computation along a partition terminates if its count is less than min sup. BUC is very sensitive to the skew of the data. The performance of BUC Proceedings of the 3rd National Conference; INDIACo m-2009 degrades when the skew of data increases. For sparse data, BUC is good and Star-Cubing is poor Star-Cubing, that integrates the strength of both top-down and bottom-up cube computation, and explores a few addit ional optimization techniques. Two optimizat ion techniques are worth noting: (1) shared aggregation by taking advantage of shared dimensions among the current cuboid and its descendant cuboids; and (2) prune as soon as possible the unpromising cells during the cube co mputation using the anti-monotonic property of the iceberg cube measure. There are three closed iceberg cubing algorithms: C-Cubing (MM), C- Cubing (Star), and C-Cubing (StarArray), with the variations of cardinality, skew, min sup, and data dependence. The Star family algorith ms perform better when min sup is low. C-Cubing (MM) is good when min sup is high. The switching point of min sup increases with the dependence in the data. High dependence incurs more c-pruning, thus it benefits the Star algorith ms. Co mparing C-Cubing (Star) and C-Cubing (StarArray), the former is better if the cardinality is low; otherwise, C-Cubing (StarA rray) is better. BUC, Star-Cubing and MM-Cubing, with variations of density, min sup, cardinality and skewness. For dense data, Star-Cubing is good and BUC is poor. For sparse data, BUC is good and Star-Cubing is poor. Both algorithms are poorer than MM-Cubing when the data is heterogonous (med iu m skewed, partly dense, and partly sparse). MMCubing performs uniformly well on all the data sets. Although there is no all-around clear-cut winner; however, in most cases, MMCubing performs better or substantially better than others. FUTUR E WORK As for future work, we discuss the related work and possible Extensions if approach. For efficient computation of closed (iceberg) cubes, we have proposed an aggregation-based cchecking approach, C-Cubing. With this approach, we proposed and imp lemented three algorithms: C-Cubing (MM), C- cubing(Star) and C-Cubing (Star Array). All the three algorithms outperform the previous approach. Among them, we have found C-Cubing(MM) is good when ice-berg pruning dominates the computation, whereas the Star family algorith ms perform better when c-pruning is significant. Incorporating constraints with various cube computation Dealing with holistic functions Applying different co mpression technique to compress cube Supporting incremental and batch updates REFERENCES [1] Y. Zhao, P. Deshpande, J. F. Naughton: An A rray-Based Algorith m for Simu ltaneous Multidimensional Aggregates. SIGMOD’97. [2] D. Xin, J. Han, X. Li, B. W. Wah. Star-Cub ing: Co mputing Iceberg Cubes by Top - Down and BottomUp Integration.VLDB’03. [3] K. Beyer and R. Ramakrishnan. Bottom-up co mputation of sparse and iceberg cubes. SIGMOD’99, 359– 370. [4] Z. Shao et al. MM-Cubing: Co mputing Iceberg Cubes by Factorizing the Lattice Space. SSDBM'04 [5] J. Han, J. Pei, G. Dong, and K. Wang.. Efficient Co mputation of Iceberg Cubes with Co mplex Measures.SIGMOD'01. [6] D. Xin et al. C-Cubing: Efficient Co mputation of Closed Cubes by Aggregation-Based Checking. Technical Report UIUCDCS-R-2005-2648, Depart ment of Co mputer Science, UIUC, October 2005. [7] Findlater, L., and Hamilton, H.J. ``Iceberg Cube Algorith ms: An Empirical Evaluationon Synthetic and Real Data,'' Intelligent Data Analysis, 7(2), 2003. Accepted April, 2002. .