Download Data Cubing Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Proceedings of the 3rd National Conference; INDIACo m-2009
Co mputing For Nation Development, February 26 – 27, 2009
Bharati Vidyapeeth’s Institute of Co mputer Applications and Management, New Delh i
Data Cubing Algorithms - Comparative Study of Data Cubing Algorithms
Beena Mahar
E-Mail: [email protected]
ABSTRACT
A data warehouse is a subject oriented, integrated, timevariant and nonvolatile collection of data organized in
support of management decision-making. Several factors
distinguish data warehouses from operational databases.
Because the two systems provides quite different
functionalities and require different kinds of data, it is
necessary to maintain data warehouses separately from
operational database.
A data cube is the core of the multidimensional model, which
consists of a large set of facts, measures and a number of
dimensions. Data cube consists of lattice of cuboids, each
corresponding to a different degree of summarization of the
given multidimensional data.
Data cube computation is essential task in data warehouse
implementation. The precomputation of all or part of a data
cube can greatly reduce the response time and enhance the
performance of on-line analytical processing. There are
several methods for cube computation, several strategies to
cube materialization and some specific computation
algorithms, namely Multiway array aggregation, BUC, C,
Star Cubing the computation of shell fragments, and the
computation of cubes involving complex measures. Full
materialization refers to the Computation of all of the cuboids
in the lattice defining a data cube. It typically requires an
excessive amount of storage space, particularly as the
number of dimensions and size of associated concepts
hierarchies grow. This problem is known as the curse of
dimensionality. Alternatively, partial materialization is the
selective computation of a subset of the cuboids or sub cubes
in the lattice. For example, an iceberg cube is a data cube
that stores only those cube cells whose aggregate value (e.g.
count) is above some minimum support threshold.
We have selected to do a comparative study of the abovementioned specific data cubing algorithms as the core topic
of our paper. Our paper is organized as follows: Firstly
explanation of data cubes and various existing cubing
algorithms and then detailed study of each cubing algorithms
and at last the conclusion of our study.
INTRODUCTION
Data Cubes
Data cubing is a process of computing the set of all possible
group-by's fro m a base table. It facilitates many OLA P
operations such as drilling-down or roll-up. Our package
features several algorith ms for co mputing the full data cube
(all group-by's), iceberg cubes (group-by's satisfying a
minimu m support value), and closed cubes (group-by's which
are not subsumed by any other group-by's ).
Users of decision support systems often see data in the form
of data cubes. The cube is used to represent data along some
measure of interest. Although called a "cube", it can be 2dimensional, 3-dimensional, or higher-d imensional. Each
dimension represents some attribute in the database and the
cells in the data cube represent the measure of interest. For
example, they could contain a count for the number of times
that attribute combination occurs in the database, or the
minimu m, maximu m, sum or average value of some attribute.
Queries are performed on the cube to retrieve decision
support information.
Example : We have a database that contains transaction
informat ion relat ing company sales of a part to a customer at
a store location. The data cube formed fro m this database is a
3-dimensional representation, with each cell (p,c,s) of the
cube representing a combination of values from part,
customer and store-location. A sample data cube for this
combination is shown in Figures(Fig a and b). The contents of
each cell is the count of the number of times that specific
combination of values occurs together in the database. Cells
that appear blank in fact have a value of zero. The cube can
then be used to retrieve informat ion within the database
about, for example, which store should be given a certain part
to sell in order to make the greatest sales.
KEYWORDS
Data Cubes, Data Cube Algorith ms, Star Cubing, MM
Cubing, C, Multiway Array Aggregation, BUC
Fig a): Front View of Samp le Data Cube
Proceedings of the 3rd National Conference; INDIACo m-2009
Fig b): Ent ire View o f Samp le Data Cube
Computed versus Stored Data Cubes
The goal is to retrieve the decision support information fro m
the data cube in the most efficient way possible. Three
possible solutions are:
 Pre-co mpute all cells in the cube
 Pre-co mpute no cells
 Pre-co mpute some of the cells
If the whole cube is pre -computed, then queries run on the
cube will be very fast. The disadvantage is that the precomputed cube requires a lot of memory. The size o f a cube
for n attributes D1 ,...,Dn with card inalities |D1 |,...,|Dn | is π|Di |.
This size increases exponentially with the number of
attributes and linearly with the cardinalit ies of th ose
attributes.
To minimize memory requirements, we can pre-co mpute
none of the cells in the cube. The disadvantage here is that
queries on the cube will run more slowly because the cube
will need to be rebuilt for each query.
As a compromise between these two, we can pre-co mpute
only those cells in the cube which will most likely be used for
decision support queries. The trade-off between memo ry
space and computing time is called the space-time trade-off,
and it often exists in data min ing and computer s cience in
general.
Data Algorithms
Efficient co mputation of data cubes has been one of the
focusing points in research since the introduction of data
warehousing, OLAP, and data cube.
Data cubing algorith ms mainly fall into 5 categories...
(1) Co mputation of full or iceberg cubes with simple o r
complex measures
(2) Approximate co mputation of compressed data cubes,
such as quasi-cubes, wavelet cubes, etc.
(3) Closed cube computation with index structure, such as
condensed, dwarf, or quotient cubes
(4) Selective materia lization of views
(5) Cubes computation in stream data for mult i-d imensional
regression analysis
D a t a C u be C o m p u t a t i o n
 Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid

The top-most cuboid (apex) contains only one cell

How many cuboids in an n-dimensional cube with L
levels?

Materialization of data cube

Materialize every (cuboid) (fu ll materialization), none
(no materializat ion), or some (partial materializat ion)

Selection of which cuboids to materialize

Based on size, sharing, access frequency, etc.
C u be O pe r a t i o n
Cube definit ion and computation in DMQL define cube
sales[item, state, year]: sum(sales_in_dollars) co mpute
cube sales
Transform it into a SQL-like language
SELECT item, state, year, SUM (amount)
FROM SALES
CUBE BY item, state, year
 Need compute the following Group-Bys (date, product
Cube Operation ct, customer),(date, product),(date,
customer),
(product,
customer),(date),
(product),
(customer)
(item)
(state)
(state, item)
(item)
(state, year)
(year)
(item, year)
(state, item, year)
Fig : C u b e O p e r a t i o n
Ef fi c i e n t C o m p u t ati o n of D at a Cu b e s
Preliminary cube co mputation tricks
Co mputing full/iceberg cubes: 3 methodologies
o Top-Down: Multi-Way array aggregation
o Bottom-Up:
 Bottom-up co mputation: BUC
 H-cubing technique
o Integrating Top-Down and Bottom-Up :
 Star-cubing algorithm
High-dimensional OLAP: A Minimal Cubing Approach
Co mputing alternative kinds of cubes:
Data Cubing Algorith ms - Co mparative Study of Data Cubing Algorith ms
o
Partial cube, closed cube, approximates cube, etc.
cell,
all
B
cost.



B
c
0
b
3
b
2
b
1
b
0
c3
c
4
c 22 5
1 9
B1
3
9
5
1
a
0
memory
access
and
6
63
6
4 2
4
4 4
3 6 31 7 32 8
0
1
1
16
4
4
5
2 4
8 4
2
0
4 3
2 6
2
3
4
0
a1
a2
a3
storage
6
1
6
0
5
6
5
2
A
What is the best traversing order to do multi -way
aggregation?

A
reduces
C
Preliminary Tricks
 Sorting, hashing, and grouping operations are applied to
the dimension attributes in order to reorder and cluster
related tuples
 Aggregates may be computed from p reviously computed
aggregates, rather than from the base fact table
 Smallest-chil d: co mputing a cuboid from the smallest,
previously computed cuboid
 Cache-results: caching results of a cuboid from which
other cuboids are computed to reduce disk I/Os
 Amortize-scans: computing as many as possible cuboids
at the same time to amo rtize disk reads
 Share-sorts:
sharing sorting costs cross multip le
cuboids when sort-based method is used
 Share-partiti ons: sharing the partitioning cost across
mu ltip le cuboids when hash-based algorithms are used
1)
Multi-Way Array Aggregation
 Array-based ―bottom-up‖ algorith m
 Using multi-dimensional chunks
and
C
Method: the planes should be sorted and computed
according to their size in ascending order
Idea: keep the smallest plane in the main memory, fetch
and compute only one chunk at a time for the largest
plane
Limitation of the method: computing well only for a
small nu mber of dimensions
If there are a large number of dimensions, ―top-down‖
computation and iceberg cube computation methods can
be explored
2) Bottom-Up Computati on (B UC)
AB
AC
BC






ABC
No direct tuple co mparisons
Simu ltaneous aggregation on multiple dimensions
Intermediate aggregate values are re-used for computing
ancestor cuboids
Cannot do Apriori pruning: No iceberg optimization
Multi-way Array Aggregation for Cube Computation
(MOLAP)



Partit ion arrays into chunks (a small sub cube which fits
in memory ).
Co mpressed sparse array addressing: (chunk_id, offset)
Co mpute aggregates in ―multiway‖ by visiting cube cells
in the order which minimizes the # of times to visit each


Div ides dimensions into partitions and facilitates iceberg
pruning
If a partition does not satisfy min_sup, its descendants
can be pruned
If minsup = 1 Þ co mpute full CUBE
No simultaneous aggregation
Proceedings of the 3rd National Conference; INDIACo m-2009
all
b2
A
AB
ABC
B
AC
AD
ABD
C
d1
d2
D
BC
CD
BD
ACD
BCD
Fig a) BUC
3) Star-Cubing: An Integrating Method
ABCD

1 all
2A
3 AB
4 ABC
7 AC
6 ABD
10 B
14 C
16 D
9 AD 11 BC 13 BD
8 ACD


Integrate the top-down and bottom-up methods

Explore shared dimensions
E.g., dimension A is the shared dimension
of ACD and AD
ABD/AB means cuboid ABD has shared dimensions AB


Allows for shared computations
e.g., cuboid AB is co mputed simultaneously as ABD

Aggregate in a top-down manner but with the bottom-up
sub-layer underneath which will allow Apriori pruning

Shared dimensions grow in bottom-up fashion
15 CD
12 BCD
5 ABCD
Fig : BUC
C/C
D
B UC: Partiti oning

Usually, entire data set can’t fit in main memory

Sort distinct values, partition into blocks that fit

Continue processing


Optimizations
Partit ioning
AC/AC
ABC/ABC
AD/A
ABD/AB
BC/BC
BD/B
ACD/A
CD
BCD
ABCD/all
Fig b): An Integrating Method-Star Cube


External Sorting, Hashing, Counting Sort
Ordering dimensions to encourage pruning



Card inality, Skew, Correlation
Collapsing duplicates
Can’t do holistic aggregates anymore!
Star-Cubing Algorithm—DFS on Lattice Tree
Properties of Proposed Method

Partit ions the data vertically

Reduces high-dimensional cube into a set of lower
dimensional cubes

Online re-construction of original h igh-dimensional
space
Data Cubing Algorith ms - Co mparative Study of Data Cubing Algorith ms


Lossless reduction

Offers tradeoffs between the amount of preprocessing and the speed of online computation
Only need to store one cell (a1, a2, …,
a100, 10), which represents all the corresponding
aggregate cells


Further Imp lementation Considerations
Adv.
Fully precomputed cube without
compression

Incremental Update:



Efficient
condensed cube

Closed cube
Append more TIDs to inverted list
Add <tid: measure> to ID_ measure table



Bit map indexing


May further imp rove space usage and speed
Inverted index co mpression


Store as d-gaps
Exp lore more IR co mpression methods
all
BCD: 51
b*: 33
A /A
B/B
b1: 26
C/C
D/D
root: 5
c*: 14
c3: 211
AB/AB
d*: 15
c*: 27
AC/AC AD/A BC/BC BD/B
d4: 212
ABC/ABC ABD/AB
CD
a1: 3
a2: 2
d*: 28
ACD/A
BCD
b*: 1
b1: 2
b*: 2
c*: 1
c*: 2
c3: 2
d*: 1
d*: 2
d4: 2
ABCD
Fig c): Star-Cubing Algorith m—DFS on Lattice Tree
4) Compressed Cubes: Condensed or Cl osed Cubes

min imal
C-Cubing
Form new inverted list and add new
frag ments

of the
Incremental adding new d imensions


computation
Icerberg cube cannot solve all the problems
Suppose 100 dimensions, only 1 base cell
with count = 10. How many aggregate (non-base) cells if
count >= 10?
Condensed cube
CONCLUS ION
Cube computation factorizes the lattice space. It is well
recognized that data cubing often produces huge outputs. Two
popular efforts devoted to this problem are (1) iceberg cube,
where only significant cells are kept, and (2) closed cube,
where a group of cells which preserve roll-up/drill-down
semantics are loss-lessly compressed to one cell. Due to its
usability and importance, efficient computation of closed
cubes still warrants a thorough study.
MM-Cubing performs well on an extensive set of data. MMCubing algorithm, which first factorizes the lattice space and
then computes (1) the dense subspace by simu ltaneous
aggregation, and (2) the sparse subspaces by recursive calls to
itself. For the unifo rm data distribution, MM Cubing is almost
the same as the better one of Star-Cubing and BUC and is
significantly better than the worse one. When the data is
skewed, MM-Cubing is better than both. Thus MM-Cubing is
the only cubing algorithm so far that has uniformly high
performance in all the data distributions.
MM-Cubing performs best when major values in different
dimensions are correlated (appear in the same tuples), since in
this case the dense subspace will be very dense and the
simu ltaneous aggregation is extremely efficient. The
experiments are all based on data with dimensional
independence. The worst case for MM-Cubing will be that
major values in one dimension are always correlated with
minor values in other dimensions. However, although
sometimes it may happen in real datasets, it is highly unlikely
that it holds for all factorizations in the recursive calls. Even
in this case, MM-Cubing won’t perform much worse than
BUC, since the co mputation time for the dense subspace is
small co mpared to recursive calls.
BUC emp loys a bottom-up co mputation by expanding
dimensions. Cuboids with fewer d imensions are parents of
those with more dimensions. BUC starts by reading the first
dimension and partitioning it based on its distinct values. For
each partition, it recursively computes the remaining
dimensions. The bottom-up computation order facilitates the
Apriori-based pruning: The computation along a partition
terminates if its count is less than min sup. BUC is very
sensitive to the skew of the data. The performance of BUC
Proceedings of the 3rd National Conference; INDIACo m-2009
degrades when the skew of data increases. For sparse data,
BUC is good and Star-Cubing is poor
Star-Cubing, that integrates the strength of both top-down and
bottom-up cube computation, and explores a few addit ional optimization
techniques. Two optimizat ion techniques are worth noting:
(1) shared aggregation by taking advantage of shared
dimensions among the current cuboid and its descendant
cuboids; and (2) prune as soon as possible the unpromising
cells during the cube co mputation using the anti-monotonic
property of the iceberg cube measure.
There are three closed iceberg cubing algorithms: C-Cubing
(MM), C- Cubing (Star), and C-Cubing (StarArray), with the
variations of cardinality, skew, min sup, and data dependence.
The Star family algorith ms perform better when min sup is
low. C-Cubing (MM) is good when min sup is high. The
switching point of min sup increases with the dependence in
the data. High dependence incurs more c-pruning, thus it
benefits the Star algorith ms. Co mparing C-Cubing (Star) and
C-Cubing (StarArray), the former is better if the cardinality is
low; otherwise, C-Cubing (StarA rray) is better.
BUC, Star-Cubing and MM-Cubing, with variations of
density, min sup, cardinality and skewness. For dense data,
Star-Cubing is good and BUC is poor. For sparse data, BUC
is good and Star-Cubing is poor. Both algorithms are poorer
than MM-Cubing when the data is heterogonous (med iu m
skewed, partly dense, and partly sparse). MMCubing
performs uniformly well on all the data sets. Although there is
no all-around clear-cut winner; however, in most cases, MMCubing performs better or substantially better than others.
FUTUR E WORK
As for future work, we discuss the related work and possible
Extensions if approach. For efficient computation of closed
(iceberg) cubes, we have proposed an aggregation-based cchecking approach, C-Cubing. With this approach, we
proposed and imp lemented three algorithms: C-Cubing
(MM), C- cubing(Star) and C-Cubing (Star Array). All the
three algorithms outperform the previous approach. Among
them, we have found C-Cubing(MM) is good when ice-berg
pruning dominates the computation, whereas the Star family
algorith ms perform better when c-pruning is significant.
Incorporating constraints with various cube computation
Dealing with holistic functions
Applying different co mpression technique to compress
cube
Supporting incremental and batch updates
REFERENCES
[1] Y. Zhao, P. Deshpande, J. F. Naughton: An A rray-Based
Algorith m
for Simu ltaneous Multidimensional
Aggregates. SIGMOD’97.
[2] D. Xin, J. Han, X. Li, B. W. Wah. Star-Cub ing:
Co mputing Iceberg Cubes by Top - Down and BottomUp Integration.VLDB’03.
[3] K. Beyer and R. Ramakrishnan. Bottom-up co mputation
of sparse and iceberg cubes. SIGMOD’99, 359– 370.
[4] Z. Shao et al. MM-Cubing: Co mputing Iceberg Cubes by
Factorizing the Lattice Space. SSDBM'04
[5] J. Han, J. Pei, G. Dong, and K. Wang.. Efficient
Co mputation of Iceberg Cubes with Co mplex
Measures.SIGMOD'01.
[6] D. Xin et al. C-Cubing: Efficient Co mputation of Closed
Cubes by Aggregation-Based Checking. Technical
Report
UIUCDCS-R-2005-2648,
Depart ment
of
Co mputer Science, UIUC, October 2005.
[7] Findlater, L., and Hamilton, H.J. ``Iceberg Cube
Algorith ms: An Empirical Evaluationon Synthetic and
Real Data,'' Intelligent Data Analysis, 7(2), 2003.
Accepted April, 2002.
.