Download Lecture 4, Data Cube Computation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia, lookup

Transcript
11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
A Roadmap for Data Cube Computation
 Full Cube

Full materialization

Materializing all the cells of all of the cuboids for a given data cube

Issues in time and space
 Iceberg Cube

Partial materialization

Materializing the cells of only interesting cuboids

Materializing only the cells in a cuboid whose measure value is above
the minimum threshold
 Closed Cube

Materializing only closed cells
1
11/19/2015
Typical Computation Process
 Example
all
time
time,item
item
location
time,location
item,location
time,supplier
time,item,location
supplier
location,supplier
item,supplier
time,location,supplier
time,item,supplier
item,location,supplier
time, item, location, supplier
Computation Techniques
 Aggregating

Aggregating from the smallest child cuboid
 Caching

Caching the result of a cuboid for the computation of other cuboids to
reduce disk I/O
 Sorting, Hashing and Grouping

Sorting, hashing, and grouping operations are applied to a dimension
in order to reorder and cluster
 Pruning

A priori pruning the cells with lower support than minimum threshold
2
11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
 Multi-way Array Aggregation
 BUC: Iceberg Cube Computation
 Star-Cubing
 Shell Fragment Cube Computation
Multi-Way Array Aggregation
 Features

Array-based “bottom-up” approach

Uses multi-dimensional chunks

No direct tuple comparisons

Simultaneous aggregation
on multiple dimensions

all
A
B
C
Intermediate aggregate values
are re-used for computing ancestor
cuboids

AB
AC
BC
Full materialization (No iceberg
optimization, No Apriori pruning)
ABC
3
11/19/2015
Aggregation Strategy
 Partitions array into chunks

Chunk: a small sub-cube which fits in memory
 Data addressing

Uses chunk id and offset
C
 Multi-way Aggregation

c0
Computes aggregates in
multi-way

B
Visits chunks in the order
• to minimize memory
access
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
• to minimize memory space
60
44
28 56
40
24 52
36
20
A
Aggregation Process
 Example
c0
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
4
11/19/2015
Aggregation Process
 Example - Continued
c0
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
Optimization of Memory Usage
 What is the best traversing order?

Depends on the memory space required
 Example

Suppose the data size on
C
each dimension A, B and C
c0
is 40, 400 and 4000,
respectively.

Minimum memory required
when traversing the order,
1,2,3,4,5,…, 64

Total memory required is
100×1000 + 40×1000 +
40×400
B
c1
c2
c3 61
62
63
64
45
46
47
48
29
30
31
32
b3
B13
14
15
16
b2
9
10
11
12
b1
5
6
7
8
b0
1
2
3
4
a0
a1
a2
a3
60
44
28 56
40
24 52
36
20
A
5
11/19/2015
Summary of Multi-Way
 Method

Cuboids should be sorted and computed according to the data size

Keeps the smallest plane in the main memory, fetches and computes
on each dimension
only one chunk at a time for the largest plane
 Limitations

Full materialization

Computes well only for a small number of dimensions
( high dimensional data → partial materialization )
Reference

Zhao, Y., Deshpande, P.M. and Naughton, J.F., “An Array-Based Algorithm
for Simultaneous Multidimensional Aggregates”, In Proceedings of SIGMOD
(1997)
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
 Multi-way Array Aggregation
 BUC: Iceberg Cube Computation
 Star-Cubing
 Shell Fragment Cube Computation
6
11/19/2015
Bottom-Up Computation (BUC)
 Characteristics

“Top-down” approach

Partial materialization (iceberg cube computation)

Divides dimensions into partitions
1 all
and facilitates iceberg pruning

No simultaneous aggregation
2A
3 AB
4 ABC
10 B
7 AC
14 C
16 D
9 AD 11 BC 13 BD
6 ABD
8 ACD
15 CD
12 BCD
5 ABCD
Iceberg Pruning Process
 Partitioning

Sorts data values

Partitions into blocks that fit in memory
 Apriori Pruning

For each partition
• If it does not satisfy min_sup, its descendants are pruned
• If it satisfies min_sup,
materialization and a recursive call including the next dimension
7
11/19/2015
Apriori Pruning Example
 Description in SQL

Iceberg cube computation
compute cube iceberg_cube as
select A, B, C, D, count(*)
from R
cube by A, B, C, D
having count(*) ≥ min_sup
 Example
( , , ,  )
0-D cuboid
( a1, , ,  )
1-D cuboid
( a1, b1, ,  )
2-D cuboid
( a1, b1, c1,  )
3-D cuboid
( a1, b1, c1, d1 )
pruning
( a1, b1, c2,  )
3-D cuboid
( a1, b2, ,  )
pruning
Summary of BUC
 Method

Computation of sparse data cubes
 Limitations

Sensitive to the order of dimensions
→ The most discriminating dimension should be used first
→ Dimensions should be in the order of decreasing cardinality
→ Dimensions should be in the order of increasing maximum number
of duplicates
Reference

Beyer, K. and Ramakrishnan, R., “Bottom-Up Computation of Sparse and
Iceberg CUBEs”, In Proceedings of SIGMOD (1999)
8
11/19/2015
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
 Multi-way Array Aggregation
 BUC: Iceberg Cube Computation
 Star-Cubing
 Shell Fragment Cube Computation
Characteristics of Star-Cubing
 Characteristics

Integrated method of “top-down” and “bottom-up” cube computation

Explores both multidimensional aggregation (as Multi-way) and

Explores shared dimensions
apriori pruning (as BUC)
all
e.g., dimension A is a shared
A/A
dimension of ACD and AD
B/B
C/C
D
e.g., ABD/AB means ABD has
the shared dimension AB
AB/AB
ABC/ABC
AC/AC
ABD/AB
AD/A
BC/BC
ACD/A
BD/B CD
BCD
ABCD
9
11/19/2015
Iceberg Pruning Strategy
 Iceberg Pruning in a Shared Dimension

If AB is a shared dimension of ABD, then the cuboid AB is computed
simultaneously with ABD

If the aggregate on a shared dimension does not satisfy the iceberg
condition (min_sup), then all the cells extended from this shared
dimension cannot satisfy the condition either

If we can compute the shared dimensions before the actual cuboid,
we can apply Apriori pruning
→ Pruning while aggregating simultaneously on multiple dimensions
Cuboid Trees
 Cuboid Trees

Tree structure to represent cuboids

Base cuboid tree, 3-D cuboid trees, 2-D cuboid trees, …

Each level represents a dimension

Each node represents an attribute

Each node includes the attribute value,
root: 100
aggregate value, descendant(s)

represents a tuple

a1: 30
The path from the root to a leaf
b1: 10
a2: 20
b2: 10
a3: 20
a4: 20
b3: 10
Example
• count(a1,b1,*,*) = 10
• count(a1,b1,c1,*) = 5
c1: 5
c2: 5
d1: 2
d2: 3
10
11/19/2015
Star Nodes
 Star Nodes

If the single dimensional aggregate does not satisfy min_sup,

The nodes are replaced by *

Example (min_sup = 2)
no need to consider the node in the iceberg cube computation
A
B
C
D
count
a1
b1
*
*
1
a1
b1
*
*
1
A
B
C
D
count
a1
*
*
*
1
a1
b1
c1
d1
1
a2
*
c3
d4
1
a2
*
c3
d4
1
a1
b1
c4
d3
1
a1
b2
c2
d2
1
a2
b3
c3
d4
1
a2
b4
c3
d4
1
A
B
C
D
count
a1
b1
*
*
2
a1
*
*
*
1
a2
*
c3
d4
2
Star Tree Structure
 Star Tree

A cuboid tree that is compressed using star nodes
 Star Tree Construction

Uses the compressed table

Keeps the star table for lookup
root:5
of star nodes

a1:3
Star Table
b2
*
b3
*
a2:2
Lossless compression from
the original cuboid tree
b*:1
b1:2
b*:2
c*:1
c*:2
c3:2
d*:1
d*:2
d4:2
b4
c1
*
*
c2
*
d1
...
*
11
11/19/2015
Multi-Way Star-Tree Aggregation
 Aggregation

DFS (depth-first-search) from the root of a star tree

Creates star trees for the cuboids on the next level
root:5 1
a1:32
BCD:5 1
a2:2
a1CD/a1:3 2
b*:1 3
c*:1 4
b*:13
b1:2
b*:2
c*:1 4
d*:1 5
c*:14
c*:2
c3:2
d*:1 5
d*:15
d*:2
d4:2
Base−Tree
BCD−Tree
ACD/A−Tree
a1b*D/a1b*:1 3 a1b*c*/a1b*c*:1 4
d*:15
ABD/AB−Tree
ABC/ABC−Tree
 Pruning

Prunes if the aggregates do not satisfy min_sup

Prunes if all the nodes in the generated tree are star nodes
Implementation of Star Cubing
 Method

Multi-way aggregation & iceberg pruning
all
C/C
A/A
BCD :3
D/D
b*:3
AC/AC
pruning
ABC/ABC
pruning
ABD/AB
pruning
AD/A BC/BC
ACD/A
pruning
BD/B
c*:1
c3:2
c*:2
d*:1
d4:2
d*:2
CD
root:5
BCD
a1:3
ABCD
b1:2
a2:2
b*:1
b1:2
b*:2
c*:1
c*:2
c3:2
d*:1
d*:2
d4:2
12
11/19/2015
Summary of Star Cubing
 Limitations

Sensitive to the order of dimensions
→ The order of decreasing cardinality
Reference

Xin, D., Han, J., Li, X. and Wah, B.W., “Star-Cubing: Computing Iceberg
Cubes by Top-Down and Bottom-Up Integration”, In Proceedings of VLDB
(2003)
CSI 4352, Introduction to Data Mining
Lecture 4, Data Cube Computation
 Multi-way Array Aggregation
 BUC: Iceberg Cube Computation
 Star-Cubing
 Shell Fragment Cube Computation
13
11/19/2015
Shell Fragment Cube Computation
 Motivations

The computation and storage of iceberg cube are still costly for high

Hard to determine an appropriate iceberg threshold

No update in an iceberg cube
dimensional data ( “the curse of dimensionality” )
 Features

Reduces a high dimensional cube into a set of lower dimensional cubes

Lossless reduction

Online re-construction of high-dimensional data cube
Fragmentation Strategy
 Observation

OLAP occurs only on a small subset of dimensions at a time
 Fragmentation

Partitions the set of dimensions into shell fragments

Computes data cubes for each shell fragment
( 20 3-D data cube computation is much better than 1 60-D data cube.)

Retains inverted indices or value-list indices
 Semi-Online Computation

Given the pre-computed fragment cubes, dynamically compute cube cells
of the high-dimensional cube online
14
11/19/2015
Inverted Index
 Example
TID
A
B
C
D
E
1
a1
b1
c1
d1
e1
2
a1
b2
c1
d2
e1
3
a1
b2
c1
d1
e2
4
a2
b1
c1
d1
e2
5
a2
b1
c1
d1
e3
Value

Size
1 2 3
a2
4 5
2
b1
1 4 5
3
b2
2 3
2
c1
1 2 3 4 5
5
Builds the inverted index
d1
1 3 4 5
4
(1-D inverted index)
d2
2
1
e1
1 2
2
e2
3 4
2
e3
5
1
Divides the 5 dimensions into
2 shell fragments, (A, B, C) and (D, E)

TID List
a1
3
Cube Computation
 Process

Generalizes the 1-D inverted indices to multi-dimensional inverted indices

Computes cuboids using the inverted indices
 Example

Shell fragment cube ABC contains 7 cuboids, A, B, C, AB, AC, BC, ABC

The cuboid AB is computed using the 2-D inverted index below
Cell
Intersection
(a1,b1)
{1,2,3}∩{1,4,5}
(a1,b2)
TID List
Size
{1}
1
{1,2,3}∩{2,3}
{2,3}
2
(a2,b1)
{4,5}∩{1,4,5}
{4,5}
2
(a2,b2)
{4,5}∩{2,3}
{}
0
15
11/19/2015
Online Query Computation
 Process

Given shell fragment cubes

Divides the query into shell fragments

Fetches the corresponding TID list for each fragment from the cubes

Intersects the TID lists from each fragment to construct instantiated

Computes the online cube using the base table with any data cube

Computes the query
base table
computation algorithm
Shell Fragment Cube Process
Dimensions
D Cuboid
EF Cuboid
DE Cuboid
A B C D E F …
ABC
Cube
DEF
Cube
Cell
T-ID List
d1 e1
{1, 3, 8, 9}
d1 e2
{2, 4, 6, 7}
d2 e1
{5, 10}
…
…
16
11/19/2015
Online Query Process
A B C D E F GH I J K L MN …
Instantiated
Base Table
Online
Cube
Summary of Shell Fragment Cube
 Advantages

Significant increase of the speed of data cube computation

Significant decrease of the storage usage

Various applications in very high dimensional data
 Limitations

Tradeoffs between the preprocessing time to construct a data cube and

Tradeoffs between the storage for a data cube and the memory for
the online computation time for queries
online computation
 Reference

Li, X., Han, J. and Gonzalez, H., “High-Dimensional OLAP: A Minimal
Cubing Approach”, In Proceedings of VLDB (2004)
17
11/19/2015
Questions?
 Lecture Slides are found on the Course Website,
www.ecs.baylor.edu/faculty/cho/4352
18