Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
A Roadmap for Data Cube Computation
Full Cube
Full materialization
Materializing all the cells of all of the cuboids for a given data cube
Issues in time and space
Iceberg Cube
Partial materialization
Materializing the cells of only interesting cuboids
Materializing only the cells in a cuboid whose measure value is above
the minimum threshold
Closed Cube
Materializing only closed cells
1
11/19/2015
Typical Computation Process
Example
all
time
time,item
item
location
time,location
item,location
time,supplier
time,item,location
supplier
location,supplier
item,supplier
time,location,supplier
time,item,supplier
item,location,supplier
time, item, location, supplier
Computation Techniques
Aggregating
Aggregating from the smallest child cuboid
Caching
Caching the result of a cuboid for the computation of other cuboids to
reduce disk I/O
Sorting, Hashing and Grouping
Sorting, hashing, and grouping operations are applied to a dimension
in order to reorder and cluster
Pruning
A priori pruning the cells with lower support than minimum threshold
2
11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Multi-way Array Aggregation
BUC: Iceberg Cube Computation
Star-Cubing
Shell Fragment Cube Computation
Multi-Way Array Aggregation
Features
Array-based “bottom-up” approach
Uses multi-dimensional chunks
No direct tuple comparisons
Simultaneous aggregation
on multiple dimensions
all
A
B
C
Intermediate aggregate values
are re-used for computing ancestor
cuboids
AB
AC
BC
Full materialization (No iceberg
optimization, No Apriori pruning)
ABC
3
11/19/2015
Aggregation Strategy
Partitions array into chunks
Chunk: a small sub-cube which fits in memory
Data addressing
Uses chunk id and offset
C
Multi-way Aggregation
c0
Computes aggregates in
multi-way
B
Visits chunks in the order
• to minimize memory
access
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
• to minimize memory space
60
44
28 56
40
24 52
36
20
A
Aggregation Process
Example
c0
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
4
11/19/2015
Aggregation Process
Example - Continued
c0
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
Optimization of Memory Usage
What is the best traversing order?
Depends on the memory space required
Example
Suppose the data size on
C
each dimension A, B and C
c0
is 40, 400 and 4000,
respectively.
Minimum memory required
when traversing the order,
1,2,3,4,5,…, 64
Total memory required is
100×1000 + 40×1000 +
40×400
B
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
A
5
11/19/2015
Summary of Multi-Way
Method
Cuboids should be sorted and computed according to the data size
Keeps the smallest plane in the main memory, fetches and computes
on each dimension
only one chunk at a time for the largest plane
Limitations
Full materialization
Computes well only for a small number of dimensions
( high dimensional data → partial materialization )
Reference
Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm
for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD
(1997)
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Multi-way Array Aggregation
BUC: Iceberg Cube Computation
Star-Cubing
Shell Fragment Cube Computation
6
11/19/2015
Bottom-Up Computation (BUC)
Characteristics
“Top-down” approach
Partial materialization (iceberg cube computation)
Divides dimensions into partitions
1 all
and facilitates iceberg pruning
No simultaneous aggregation
2A
3 AB
4 ABC
10 B
7 AC
14 C
16 D
9 AD 11 BC 13 BD
6 ABD
8 ACD
15 CD
12 BCD
5 ABCD
Iceberg Pruning Process
Partitioning
Sorts data values
Partitions into blocks that fit in memory
Apriori Pruning
For each partition
• If it does not satisfy min_sup, its descendants are pruned
• If it satisfies min_sup,
materialization and a recursive call including the next dimension
7
11/19/2015
Apriori Pruning Example
Description in SQL
Iceberg cube computation
compute cube iceberg_cube as
select A, B, C, D, count(*)
from R
cube by A, B, C, D
having count(*) ≥ min_sup
Example
( , , , )
0-D cuboid
( a1, , , )
1-D cuboid
( a1, b1, , )
2-D cuboid
( a1, b1, c1, )
3-D cuboid
( a1, b1, c1, d1 )
pruning
( a1, b1, c2, )
3-D cuboid
( a1, b2, , )
pruning
Summary of BUC
Method
Computation of sparse data cubes
Limitations
Sensitive to the order of dimensions
→ The most discriminating dimension should be used first
→ Dimensions should be in the order of decreasing cardinality
→ Dimensions should be in the order of increasing maximum number
of duplicates
Reference
Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and
Iceberg CUBEs”, In Proceedings of SIGMOD (1999)
8
11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Multi-way Array Aggregation
BUC: Iceberg Cube Computation
Star-Cubing
Shell Fragment Cube Computation
Characteristics of Star-Cubing
Characteristics
Integrated method of “top-down” and “bottom-up” cube computation
Explores both multidimensional aggregation (as Multi-way) and
Explores shared dimensions
apriori pruning (as BUC)
all
e.g., dimension A is a shared
A/A
dimension of ACD and AD
B/B
C/C
D
e.g., ABD/AB means ABD has
the shared dimension AB
AB/AB
ABC/ABC
AC/AC
ABD/AB
AD/A
BC/BC
ACD/A
BD/B CD
BCD
ABCD
9
11/19/2015
Iceberg Pruning Strategy
Iceberg Pruning in a Shared Dimension
If AB is a shared dimension of ABD, then the cuboid AB is computed
simultaneously with ABD
If the aggregate on a shared dimension does not satisfy the iceberg
condition (min_sup), then all the cells extended from this shared
dimension cannot satisfy the condition either
If we can compute the shared dimensions before the actual cuboid,
we can apply Apriori pruning
→ Pruning while aggregating simultaneously on multiple dimensions
Cuboid Trees
Cuboid Trees
Tree structure to represent cuboids
Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, …
Each level represents a dimension
Each node represents an attribute
Each node includes the attribute value,
root: 100
aggregate value, descendant(s)
represents a tuple
a1: 30
The path from the root to a leaf
b1: 10
a2: 20
b2: 10
a3: 20
a4: 20
b3: 10
Example
• count(a1,b1,*,*) = 10
• count(a1,b1,c1,*) = 5
c1: 5
c2: 5
d1: 2
d2: 3
10
11/19/2015
Star Nodes
Star Nodes
If the single dimensional aggregate does not satisfy min_sup,
The nodes are replaced by *
Example (min_sup = 2)
no need to consider the node in the iceberg cube computation
A
B
C
D
count
a1
b1
*
*
1
a1
b1
*
*
1
A
B
C
D
count
a1
*
*
*
1
a1
b1
c1
d1
1
a2
*
c3
d4
1
a2
*
c3
d4
1
a1
b1
c4
d3
1
a1
b2
c2
d2
1
a2
b3
c3
d4
1
a2
b4
c3
d4
1
A
B
C
D
count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
Star Tree Structure
Star Tree
A cuboid tree that is compressed using star nodes
Star Tree Construction
Uses the compressed table
Keeps the star table for lookup
root:5
of star nodes
a1:3
Star Table
b2
*
b3
*
a2:2
Lossless compression from
the original cuboid tree
b*:1
b1:2
b*:2
c*:1
c*:2
c3:2
d*:1
d*:2
d4:2
b4
c1
*
*
c2
*
d1
...
*
11
11/19/2015
Multi-Way Star-Tree Aggregation
Aggregation
DFS (depth-first-search) from the root of a star tree
Creates star trees for the cuboids on the next level
root:5 1
a1:32
BCD:5 1
a2:2
a1CD/a1:3 2
b*:1 3
c*:1 4
b*:13
b1:2
b*:2
c*:1 4
d*:1 5
c*:14
c*:2
c3:2
d*:1 5
d*:15
d*:2
d4:2
Base−Tree
BCD−Tree
ACD/A−Tree
a1b*D/a1b*:1 3 a1b*c*/a1b*c*:1 4
d*:15
ABD/AB−Tree
ABC/ABC−Tree
Pruning
Prunes if the aggregates do not satisfy min_sup
Prunes if all the nodes in the generated tree are star nodes
Implementation of Star Cubing
Method
Multi-way aggregation & iceberg pruning
all
C/C
A/A
BCD :3
D/D
b*:3
AC/AC
pruning
ABC/ABC
pruning
ABD/AB
pruning
AD/A BC/BC
ACD/A
pruning
BD/B
c*:1
c3:2
c*:2
d*:1
d4:2
d*:2
CD
root:5
BCD
a1:3
ABCD
b1:2
a2:2
b*:1
b1:2
b*:2
c*:1
c*:2
c3:2
d*:1
d*:2
d4:2
12
11/19/2015
Summary of Star Cubing
Limitations
Sensitive to the order of dimensions
→ The order of decreasing cardinality
Reference
Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB
(2003)
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Multi-way Array Aggregation
BUC: Iceberg Cube Computation
Star-Cubing
Shell Fragment Cube Computation
13
11/19/2015
Shell Fragment Cube Computation
Motivations
The computation and storage of iceberg cube are still costly for high
Hard to determine an appropriate iceberg threshold
No update in an iceberg cube
dimensional data ( “the curse of dimensionality” )
Features
Reduces a high dimensional cube into a set of lower dimensional cubes
Lossless reduction
Online re-construction of high-dimensional data cube
Fragmentation Strategy
Observation
OLAP occurs only on a small subset of dimensions at a time
Fragmentation
Partitions the set of dimensions into shell fragments
Computes data cubes for each shell fragment
( 20 3-D data cube computation is much better than 1 60-D data cube.)
Retains inverted indices or value-list indices
Semi-Online Computation
Given the pre-computed fragment cubes, dynamically compute cube cells
of the high-dimensional cube online
14
11/19/2015
Inverted Index
Example
TID
A
B
C
D
E
1
a1
b1
c1
d1
e1
2
a1
b2
c1
d2
e1
3
a1
b2
c1
d1
e2
4
a2
b1
c1
d1
e2
5
a2
b1
c1
d1
e3
Value
Size
1 2 3
a2
4 5
2
b1
1 4 5
3
b2
2 3
2
c1
1 2 3 4 5
5
Builds the inverted index
d1
1 3 4 5
4
(1-D inverted index)
d2
2
1
e1
1 2
2
e2
3 4
2
e3
5
1
Divides the 5 dimensions into
2 shell fragments, (A, B, C) and (D, E)
TID List
a1
3
Cube Computation
Process
Generalizes the 1-D inverted indices to multi-dimensional inverted indices
Computes cuboids using the inverted indices
Example
Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC
The cuboid AB is computed using the 2-D inverted index below
Cell
Intersection
(a1,b1)
{1,2,3}∩{1,4,5}
(a1,b2)
TID List
Size
{1}
1
{1,2,3}∩{2,3}
{2,3}
2
(a2,b1)
{4,5}∩{1,4,5}
{4,5}
2
(a2,b2)
{4,5}∩{2,3}
{}
0
15
11/19/2015
Online Query Computation
Process
Given shell fragment cubes
Divides the query into shell fragments
Fetches the corresponding TID list for each fragment from the cubes
Intersects the TID lists from each fragment to construct instantiated
Computes the online cube using the base table with any data cube
Computes the query
base table
computation algorithm
Shell Fragment Cube Process
Dimensions
D Cuboid
EF Cuboid
DE Cuboid
A B C D E F …
ABC
Cube
DEF
Cube
Cell
T-ID List
d1 e1
{1, 3, 8, 9}
d1 e2
{2, 4, 6, 7}
d2 e1
{5, 10}
…
…
16
11/19/2015
Online Query Process
A B C D E F GH I J K L MN …
Instantiated
Base Table
Online
Cube
Summary of Shell Fragment Cube
Advantages
Significant increase of the speed of data cube computation
Significant decrease of the storage usage
Various applications in very high dimensional data
Limitations
Tradeoffs between the preprocessing time to construct a data cube and
Tradeoffs between the storage for a data cube and the memory for
the online computation time for queries
online computation
Reference
Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal
Cubing Approach”, In Proceedings of VLDB (2004)
17
11/19/2015
Questions?
Lecture Slides are found on the Course Website,
www.ecs.baylor.edu/faculty/cho/4352
18