Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
11/19/2015 CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation Full Cube Full materialization Materializing all the cells of all of the cuboids for a given data cube Issues in time and space Iceberg Cube Partial materialization Materializing the cells of only interesting cuboids Materializing only the cells in a cuboid whose measure value is above the minimum threshold Closed Cube Materializing only closed cells 1 11/19/2015 Typical Computation Process Example all time time,item item location time,location item,location time,supplier time,item,location supplier location,supplier item,supplier time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier Computation Techniques Aggregating Aggregating from the smallest child cuboid Caching Caching the result of a cuboid for the computation of other cuboids to reduce disk I/O Sorting, Hashing and Grouping Sorting, hashing, and grouping operations are applied to a dimension in order to reorder and cluster Pruning A priori pruning the cells with lower support than minimum threshold 2 11/19/2015 CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation Multi-Way Array Aggregation Features Array-based “bottom-up” approach Uses multi-dimensional chunks No direct tuple comparisons Simultaneous aggregation on multiple dimensions all A B C Intermediate aggregate values are re-used for computing ancestor cuboids AB AC BC Full materialization (No iceberg optimization, No Apriori pruning) ABC 3 11/19/2015 Aggregation Strategy Partitions array into chunks Chunk: a small sub-cube which fits in memory Data addressing Uses chunk id and offset C Multi-way Aggregation c0 Computes aggregates in multi-way B Visits chunks in the order • to minimize memory access c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 • to minimize memory space 60 44 28 56 40 24 52 36 20 A Aggregation Process Example c0 c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 60 44 28 56 40 24 52 36 20 4 11/19/2015 Aggregation Process Example - Continued c0 c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 60 44 28 56 40 24 52 36 20 Optimization of Memory Usage What is the best traversing order? Depends on the memory space required Example Suppose the data size on C each dimension A, B and C c0 is 40, 400 and 4000, respectively. Minimum memory required when traversing the order, 1,2,3,4,5,…, 64 Total memory required is 100×1000 + 40×1000 + 40×400 B c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 60 44 28 56 40 24 52 36 20 A 5 11/19/2015 Summary of Multi-Way Method Cuboids should be sorted and computed according to the data size Keeps the smallest plane in the main memory, fetches and computes on each dimension only one chunk at a time for the largest plane Limitations Full materialization Computes well only for a small number of dimensions ( high dimensional data → partial materialization ) Reference Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD (1997) CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation 6 11/19/2015 Bottom-Up Computation (BUC) Characteristics “Top-down” approach Partial materialization (iceberg cube computation) Divides dimensions into partitions 1 all and facilitates iceberg pruning No simultaneous aggregation 2A 3 AB 4 ABC 10 B 7 AC 14 C 16 D 9 AD 11 BC 13 BD 6 ABD 8 ACD 15 CD 12 BCD 5 ABCD Iceberg Pruning Process Partitioning Sorts data values Partitions into blocks that fit in memory Apriori Pruning For each partition • If it does not satisfy min_sup, its descendants are pruned • If it satisfies min_sup, materialization and a recursive call including the next dimension 7 11/19/2015 Apriori Pruning Example Description in SQL Iceberg cube computation compute cube iceberg_cube as select A, B, C, D, count(*) from R cube by A, B, C, D having count(*) ≥ min_sup Example ( , , , ) 0-D cuboid ( a1, , , ) 1-D cuboid ( a1, b1, , ) 2-D cuboid ( a1, b1, c1, ) 3-D cuboid ( a1, b1, c1, d1 ) pruning ( a1, b1, c2, ) 3-D cuboid ( a1, b2, , ) pruning Summary of BUC Method Computation of sparse data cubes Limitations Sensitive to the order of dimensions → The most discriminating dimension should be used first → Dimensions should be in the order of decreasing cardinality → Dimensions should be in the order of increasing maximum number of duplicates Reference Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and Iceberg CUBEs”, In Proceedings of SIGMOD (1999) 8 11/19/2015 CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation Characteristics of Star-Cubing Characteristics Integrated method of “top-down” and “bottom-up” cube computation Explores both multidimensional aggregation (as Multi-way) and Explores shared dimensions apriori pruning (as BUC) all e.g., dimension A is a shared A/A dimension of ACD and AD B/B C/C D e.g., ABD/AB means ABD has the shared dimension AB AB/AB ABC/ABC AC/AC ABD/AB AD/A BC/BC ACD/A BD/B CD BCD ABCD 9 11/19/2015 Iceberg Pruning Strategy Iceberg Pruning in a Shared Dimension If AB is a shared dimension of ABD, then the cuboid AB is computed simultaneously with ABD If the aggregate on a shared dimension does not satisfy the iceberg condition (min_sup), then all the cells extended from this shared dimension cannot satisfy the condition either If we can compute the shared dimensions before the actual cuboid, we can apply Apriori pruning → Pruning while aggregating simultaneously on multiple dimensions Cuboid Trees Cuboid Trees Tree structure to represent cuboids Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, … Each level represents a dimension Each node represents an attribute Each node includes the attribute value, root: 100 aggregate value, descendant(s) represents a tuple a1: 30 The path from the root to a leaf b1: 10 a2: 20 b2: 10 a3: 20 a4: 20 b3: 10 Example • count(a1,b1,*,*) = 10 • count(a1,b1,c1,*) = 5 c1: 5 c2: 5 d1: 2 d2: 3 10 11/19/2015 Star Nodes Star Nodes If the single dimensional aggregate does not satisfy min_sup, The nodes are replaced by * Example (min_sup = 2) no need to consider the node in the iceberg cube computation A B C D count a1 b1 * * 1 a1 b1 * * 1 A B C D count a1 * * * 1 a1 b1 c1 d1 1 a2 * c3 d4 1 a2 * c3 d4 1 a1 b1 c4 d3 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1 A B C D count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 Star Tree Structure Star Tree A cuboid tree that is compressed using star nodes Star Tree Construction Uses the compressed table Keeps the star table for lookup root:5 of star nodes a1:3 Star Table b2 * b3 * a2:2 Lossless compression from the original cuboid tree b*:1 b1:2 b*:2 c*:1 c*:2 c3:2 d*:1 d*:2 d4:2 b4 c1 * * c2 * d1 ... * 11 11/19/2015 Multi-Way Star-Tree Aggregation Aggregation DFS (depth-first-search) from the root of a star tree Creates star trees for the cuboids on the next level root:5 1 a1:32 BCD:5 1 a2:2 a1CD/a1:3 2 b*:1 3 c*:1 4 b*:13 b1:2 b*:2 c*:1 4 d*:1 5 c*:14 c*:2 c3:2 d*:1 5 d*:15 d*:2 d4:2 Base−Tree BCD−Tree ACD/A−Tree a1b*D/a1b*:1 3 a1b*c*/a1b*c*:1 4 d*:15 ABD/AB−Tree ABC/ABC−Tree Pruning Prunes if the aggregates do not satisfy min_sup Prunes if all the nodes in the generated tree are star nodes Implementation of Star Cubing Method Multi-way aggregation & iceberg pruning all C/C A/A BCD :3 D/D b*:3 AC/AC pruning ABC/ABC pruning ABD/AB pruning AD/A BC/BC ACD/A pruning BD/B c*:1 c3:2 c*:2 d*:1 d4:2 d*:2 CD root:5 BCD a1:3 ABCD b1:2 a2:2 b*:1 b1:2 b*:2 c*:1 c*:2 c3:2 d*:1 d*:2 d4:2 12 11/19/2015 Summary of Star Cubing Limitations Sensitive to the order of dimensions → The order of decreasing cardinality Reference Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB (2003) CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation Multi-way Array Aggregation BUC: Iceberg Cube Computation Star-Cubing Shell Fragment Cube Computation 13 11/19/2015 Shell Fragment Cube Computation Motivations The computation and storage of iceberg cube are still costly for high Hard to determine an appropriate iceberg threshold No update in an iceberg cube dimensional data ( “the curse of dimensionality” ) Features Reduces a high dimensional cube into a set of lower dimensional cubes Lossless reduction Online re-construction of high-dimensional data cube Fragmentation Strategy Observation OLAP occurs only on a small subset of dimensions at a time Fragmentation Partitions the set of dimensions into shell fragments Computes data cubes for each shell fragment ( 20 3-D data cube computation is much better than 1 60-D data cube.) Retains inverted indices or value-list indices Semi-Online Computation Given the pre-computed fragment cubes, dynamically compute cube cells of the high-dimensional cube online 14 11/19/2015 Inverted Index Example TID A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3 Value Size 1 2 3 a2 4 5 2 b1 1 4 5 3 b2 2 3 2 c1 1 2 3 4 5 5 Builds the inverted index d1 1 3 4 5 4 (1-D inverted index) d2 2 1 e1 1 2 2 e2 3 4 2 e3 5 1 Divides the 5 dimensions into 2 shell fragments, (A, B, C) and (D, E) TID List a1 3 Cube Computation Process Generalizes the 1-D inverted indices to multi-dimensional inverted indices Computes cuboids using the inverted indices Example Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC The cuboid AB is computed using the 2-D inverted index below Cell Intersection (a1,b1) {1,2,3}∩{1,4,5} (a1,b2) TID List Size {1} 1 {1,2,3}∩{2,3} {2,3} 2 (a2,b1) {4,5}∩{1,4,5} {4,5} 2 (a2,b2) {4,5}∩{2,3} {} 0 15 11/19/2015 Online Query Computation Process Given shell fragment cubes Divides the query into shell fragments Fetches the corresponding TID list for each fragment from the cubes Intersects the TID lists from each fragment to construct instantiated Computes the online cube using the base table with any data cube Computes the query base table computation algorithm Shell Fragment Cube Process Dimensions D Cuboid EF Cuboid DE Cuboid A B C D E F … ABC Cube DEF Cube Cell T-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} … … 16 11/19/2015 Online Query Process A B C D E F GH I J K L MN … Instantiated Base Table Online Cube Summary of Shell Fragment Cube Advantages Significant increase of the speed of data cube computation Significant decrease of the storage usage Various applications in very high dimensional data Limitations Tradeoffs between the preprocessing time to construct a data cube and Tradeoffs between the storage for a data cube and the memory for the online computation time for queries online computation Reference Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal Cubing Approach”, In Proceedings of VLDB (2004) 17 11/19/2015 Questions? Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 18