Download Lecture 4, Data Cube Computation

11/19/2015 CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation Young-Rae Cho Associate Professor Department of Computer Science Baylor University A Roadmap for Data Cube Computation  Full Cube  Full materialization  Materializing all the cells of all of the cuboids for a given data cube  Issues in time and space  Iceberg Cube  Partial materialization  Materializing the cells of only interesting cuboids  Materializing only the cells in a cuboid whose measure value is above the minimum threshold  Closed Cube  Materializing only closed cells 1 11/19/2015 Typical Computation Process  Example all time time,item item location time,location item,location time,supplier time,item,location supplier location,supplier item,supplier time,location,supplier time,item,supplier item,location,supplier time, item, location, supplier Computation Techniques  Aggregating  Aggregating from the smallest child cuboid  Caching  Caching the result of a cuboid for the computation of other cuboids to reduce disk I/O  Sorting, Hashing and Grouping  Sorting, hashing, and grouping operations are applied to a dimension in order to reorder and cluster  Pruning  A priori pruning the cells with lower support than minimum threshold 2 11/19/2015 CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation  Multi-way Array Aggregation  BUC: Iceberg Cube Computation  Star-Cubing  Shell Fragment Cube Computation Multi-Way Array Aggregation  Features  Array-based “bottom-up” approach  Uses multi-dimensional chunks  No direct tuple comparisons  Simultaneous aggregation on multiple dimensions  all A B C Intermediate aggregate values are re-used for computing ancestor cuboids  AB AC BC Full materialization (No iceberg optimization, No Apriori pruning) ABC 3 11/19/2015 Aggregation Strategy  Partitions array into chunks  Chunk: a small sub-cube which fits in memory  Data addressing  Uses chunk id and offset C  Multi-way Aggregation  c0 Computes aggregates in multi-way  B Visits chunks in the order • to minimize memory access c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 • to minimize memory space 60 44 28 56 40 24 52 36 20 A Aggregation Process  Example c0 c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 60 44 28 56 40 24 52 36 20 4 11/19/2015 Aggregation Process  Example - Continued c0 c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 60 44 28 56 40 24 52 36 20 Optimization of Memory Usage  What is the best traversing order?  Depends on the memory space required  Example  Suppose the data size on C each dimension A, B and C c0 is 40, 400 and 4000, respectively.  Minimum memory required when traversing the order, 1,2,3,4,5,…, 64  Total memory required is 100×1000 + 40×1000 + 40×400 B c1 c2 c3 61 62 63 64 45 46 47 48 29 30 31 32 b3 B13 14 15 16 b2 9 10 11 12 b1 5 6 7 8 b0 1 2 3 4 a0 a1 a2 a3 60 44 28 56 40 24 52 36 20 A 5 11/19/2015 Summary of Multi-Way  Method  Cuboids should be sorted and computed according to the data size  Keeps the smallest plane in the main memory, fetches and computes on each dimension only one chunk at a time for the largest plane  Limitations  Full materialization  Computes well only for a small number of dimensions ( high dimensional data → partial materialization ) Reference  Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD (1997) CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation  Multi-way Array Aggregation  BUC: Iceberg Cube Computation  Star-Cubing  Shell Fragment Cube Computation 6 11/19/2015 Bottom-Up Computation (BUC)  Characteristics  “Top-down” approach  Partial materialization (iceberg cube computation)  Divides dimensions into partitions 1 all and facilitates iceberg pruning  No simultaneous aggregation 2A 3 AB 4 ABC 10 B 7 AC 14 C 16 D 9 AD 11 BC 13 BD 6 ABD 8 ACD 15 CD 12 BCD 5 ABCD Iceberg Pruning Process  Partitioning  Sorts data values  Partitions into blocks that fit in memory  Apriori Pruning  For each partition • If it does not satisfy min_sup, its descendants are pruned • If it satisfies min_sup, materialization and a recursive call including the next dimension 7 11/19/2015 Apriori Pruning Example  Description in SQL  Iceberg cube computation compute cube iceberg_cube as select A, B, C, D, count(*) from R cube by A, B, C, D having count(*) ≥ min_sup  Example ( , , ,  ) 0-D cuboid ( a1, , ,  ) 1-D cuboid ( a1, b1, ,  ) 2-D cuboid ( a1, b1, c1,  ) 3-D cuboid ( a1, b1, c1, d1 ) pruning ( a1, b1, c2,  ) 3-D cuboid ( a1, b2, ,  ) pruning Summary of BUC  Method  Computation of sparse data cubes  Limitations  Sensitive to the order of dimensions → The most discriminating dimension should be used first → Dimensions should be in the order of decreasing cardinality → Dimensions should be in the order of increasing maximum number of duplicates Reference  Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and Iceberg CUBEs”, In Proceedings of SIGMOD (1999) 8 11/19/2015 CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation  Multi-way Array Aggregation  BUC: Iceberg Cube Computation  Star-Cubing  Shell Fragment Cube Computation Characteristics of Star-Cubing  Characteristics  Integrated method of “top-down” and “bottom-up” cube computation  Explores both multidimensional aggregation (as Multi-way) and  Explores shared dimensions apriori pruning (as BUC) all e.g., dimension A is a shared A/A dimension of ACD and AD B/B C/C D e.g., ABD/AB means ABD has the shared dimension AB AB/AB ABC/ABC AC/AC ABD/AB AD/A BC/BC ACD/A BD/B CD BCD ABCD 9 11/19/2015 Iceberg Pruning Strategy  Iceberg Pruning in a Shared Dimension  If AB is a shared dimension of ABD, then the cuboid AB is computed simultaneously with ABD  If the aggregate on a shared dimension does not satisfy the iceberg condition (min_sup), then all the cells extended from this shared dimension cannot satisfy the condition either  If we can compute the shared dimensions before the actual cuboid, we can apply Apriori pruning → Pruning while aggregating simultaneously on multiple dimensions Cuboid Trees  Cuboid Trees  Tree structure to represent cuboids  Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, …  Each level represents a dimension  Each node represents an attribute  Each node includes the attribute value, root: 100 aggregate value, descendant(s)  represents a tuple  a1: 30 The path from the root to a leaf b1: 10 a2: 20 b2: 10 a3: 20 a4: 20 b3: 10 Example • count(a1,b1,*,*) = 10 • count(a1,b1,c1,*) = 5 c1: 5 c2: 5 d1: 2 d2: 3 10 11/19/2015 Star Nodes  Star Nodes  If the single dimensional aggregate does not satisfy min_sup,  The nodes are replaced by *  Example (min_sup = 2) no need to consider the node in the iceberg cube computation A B C D count a1 b1 * * 1 a1 b1 * * 1 A B C D count a1 * * * 1 a1 b1 c1 d1 1 a2 * c3 d4 1 a2 * c3 d4 1 a1 b1 c4 d3 1 a1 b2 c2 d2 1 a2 b3 c3 d4 1 a2 b4 c3 d4 1 A B C D count a1 b1 * * 2 a1 * * * 1 a2 * c3 d4 2 Star Tree Structure  Star Tree  A cuboid tree that is compressed using star nodes  Star Tree Construction  Uses the compressed table  Keeps the star table for lookup root:5 of star nodes  a1:3 Star Table b2 * b3 * a2:2 Lossless compression from the original cuboid tree b*:1 b1:2 b*:2 c*:1 c*:2 c3:2 d*:1 d*:2 d4:2 b4 c1 * * c2 * d1 ... * 11 11/19/2015 Multi-Way Star-Tree Aggregation  Aggregation  DFS (depth-first-search) from the root of a star tree  Creates star trees for the cuboids on the next level root:5 1 a1:32 BCD:5 1 a2:2 a1CD/a1:3 2 b*:1 3 c*:1 4 b*:13 b1:2 b*:2 c*:1 4 d*:1 5 c*:14 c*:2 c3:2 d*:1 5 d*:15 d*:2 d4:2 Base−Tree BCD−Tree ACD/A−Tree a1b*D/a1b*:1 3 a1b*c*/a1b*c*:1 4 d*:15 ABD/AB−Tree ABC/ABC−Tree  Pruning  Prunes if the aggregates do not satisfy min_sup  Prunes if all the nodes in the generated tree are star nodes Implementation of Star Cubing  Method  Multi-way aggregation & iceberg pruning all C/C A/A BCD :3 D/D b*:3 AC/AC pruning ABC/ABC pruning ABD/AB pruning AD/A BC/BC ACD/A pruning BD/B c*:1 c3:2 c*:2 d*:1 d4:2 d*:2 CD root:5 BCD a1:3 ABCD b1:2 a2:2 b*:1 b1:2 b*:2 c*:1 c*:2 c3:2 d*:1 d*:2 d4:2 12 11/19/2015 Summary of Star Cubing  Limitations  Sensitive to the order of dimensions → The order of decreasing cardinality Reference  Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB (2003) CSI 4352, Introduction to Data Mining Lecture 4, Data Cube Computation  Multi-way Array Aggregation  BUC: Iceberg Cube Computation  Star-Cubing  Shell Fragment Cube Computation 13 11/19/2015 Shell Fragment Cube Computation  Motivations  The computation and storage of iceberg cube are still costly for high  Hard to determine an appropriate iceberg threshold  No update in an iceberg cube dimensional data ( “the curse of dimensionality” )  Features  Reduces a high dimensional cube into a set of lower dimensional cubes  Lossless reduction  Online re-construction of high-dimensional data cube Fragmentation Strategy  Observation  OLAP occurs only on a small subset of dimensions at a time  Fragmentation  Partitions the set of dimensions into shell fragments  Computes data cubes for each shell fragment ( 20 3-D data cube computation is much better than 1 60-D data cube.)  Retains inverted indices or value-list indices  Semi-Online Computation  Given the pre-computed fragment cubes, dynamically compute cube cells of the high-dimensional cube online 14 11/19/2015 Inverted Index  Example TID A B C D E 1 a1 b1 c1 d1 e1 2 a1 b2 c1 d2 e1 3 a1 b2 c1 d1 e2 4 a2 b1 c1 d1 e2 5 a2 b1 c1 d1 e3 Value  Size 1 2 3 a2 4 5 2 b1 1 4 5 3 b2 2 3 2 c1 1 2 3 4 5 5 Builds the inverted index d1 1 3 4 5 4 (1-D inverted index) d2 2 1 e1 1 2 2 e2 3 4 2 e3 5 1 Divides the 5 dimensions into 2 shell fragments, (A, B, C) and (D, E)  TID List a1 3 Cube Computation  Process  Generalizes the 1-D inverted indices to multi-dimensional inverted indices  Computes cuboids using the inverted indices  Example  Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC  The cuboid AB is computed using the 2-D inverted index below Cell Intersection (a1,b1) {1,2,3}∩{1,4,5} (a1,b2) TID List Size {1} 1 {1,2,3}∩{2,3} {2,3} 2 (a2,b1) {4,5}∩{1,4,5} {4,5} 2 (a2,b2) {4,5}∩{2,3} {} 0 15 11/19/2015 Online Query Computation  Process  Given shell fragment cubes  Divides the query into shell fragments  Fetches the corresponding TID list for each fragment from the cubes  Intersects the TID lists from each fragment to construct instantiated  Computes the online cube using the base table with any data cube  Computes the query base table computation algorithm Shell Fragment Cube Process Dimensions D Cuboid EF Cuboid DE Cuboid A B C D E F … ABC Cube DEF Cube Cell T-ID List d1 e1 {1, 3, 8, 9} d1 e2 {2, 4, 6, 7} d2 e1 {5, 10} … … 16 11/19/2015 Online Query Process A B C D E F GH I J K L MN … Instantiated Base Table Online Cube Summary of Shell Fragment Cube  Advantages  Significant increase of the speed of data cube computation  Significant decrease of the storage usage  Various applications in very high dimensional data  Limitations  Tradeoffs between the preprocessing time to construct a data cube and  Tradeoffs between the storage for a data cube and the memory for the online computation time for queries online computation  Reference  Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal Cubing Approach”, In Proceedings of VLDB (2004) 17 11/19/2015 Questions?  Lecture Slides are found on the Course Website, www.ecs.baylor.edu/faculty/cho/4352 18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 4, Data Cube Computation