Download OLAP Technology

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
OLAP Technology
Studying the Cube By
Operator
Morfonios Constantinos
July 2002
Introduction
■
■
■
■
Database Systems have been very
popular over the last decade
Powerful Hardware and Software
available in low prices
Boring every-day tasks have been
automated
No more papers => Forms,
spreadsheets, user-friendly interfaces
Motivation
■
■
■
■
■
Much information is now digitized
Data is used for every-day tasks:
transactions, payments, receipts
But what can be “hidden” behind all
this information?
Can data be useful in other ways?
Answer: OLAP, Data Mining and
Decision Support!
What is OLAP?
■
■
■
■
OLAP stands for On-Line Analytical
Processing
It comes in contrast to OLTP (OnLine Transaction Processing)
It means that we don’t care any
more about individual transactions
What we are looking for is trends,
statistics and interesting rules
behind our data that can help us in
business decisions
And why RDBMSs are
not enough?
Since we have invested on …
■ Hardware
■ Software
■ Know-how
■ Employees
And since this is a thoroughly studied
and tested area
Because of different
needs…
■
■
■
■
■
■
Historical vs. current data
Subject oriented vs. application
oriented
Complex queries “touching” millions
of tuples vs. small S-P-J queries
Several scans vs. quick indexing on
primary keys
GB vs. MB
Query throughput vs. transaction
throughput
Problems that arise
■
■
■
■
■
■
Modeling
Optimization
Indexing
Concurrency
Recovery
Administration
New ideas
■
■
■
■
■
■
Incremental updating (how/when?)
Parallelism
Use of multiple sources (even from
the Web)
Cleaning, transformation and
integration of data
Materialized views
Visualization
Data Warehouses
■
■
■
The new database technology that
supports OLAP processing is the
Data Warehouse
It supports data mining and
decision support applications
New conceptual model fits better
=> Multidimensional Model
Multidimensional Model
■
■
We are interested in numeric
values, called “measures” (revenue,
number of sold items, cost, …)
Measures are defined uniquely by
“dimensions” that provide their
context (time, store, customer,
supplier, product, …). They have
their own attributes and may form
hierarchies.
Example
Quantity of product p, bought on date d by
customer c. Quantity is a point in the 3d
space.
Time
d
p
c
Customer
Product
Why this model?
■
■
Decision makers and business
executives are used to work with
spreadsheets.
Common operations:
Pivot
◆ Roll-up
◆ Drill-down
◆ Slice-and-Dice
◆ Ranking
◆
And so comes the CUBE!
■
■
■
Most operations need multiple
aggregations
Decision support may need
information about revenue per
individual customer, per store, per
city, per week, per season, per
week and season, …
All possible combinations of
dimensions and not only since they
can be computed on different levels
of detail.
What is the CUBE?
■
■
It is the 2N extension of group by
It computes all possible group bys
Example
SELECT A, B, C, SUM(D)
FROM X
CUBE BY A, B, C
SELECT A, B, C, SUM(D)
FROM X
GROUP BY A, B, C
UNION
SELECT A, B, SUM(D)
FROM X
GROUP BY A, B
UNION
SELECT A, C, SUM(D)
FROM X
GROUP BY A, C
UNION
SELECT B, C, SUM(D)
FROM X
GROUP BY B, C
UNION
…
SELECT SUM(D)
FROM X
Naïve Solution
■
■
■
Compute all possible group bys
separately and then compute the
union of individual results
If Ci is the cardinality of dimension i
then the cube will have size (C1+1)·
(C2+1) ··· (CN+1)
First thought: This is not much
bigger than the original data
But…
■
■
■
■
The original data is usually sparse
This leads to greater difference
All these group bys seem so
similar…
There must be something smarter
to do…
How is the group by
computed?
Three basic methods:
■ Nested loops
■ Sorting
■ Hashing
The key idea is to bring identical values
together. Then, they can be
aggregated easily through one scan
Heuristic
■
If we could exploit sorting or
hashing performed for the
computation of one group by in
order to compute another, this
would be more efficient!
Group Bys form a lattice
■
■
Each node represents a group by
operation
Each (directed) arc shows a parentchild relationship. A node could be
computed by all its previous
ancestors
Example
ABC
AB
AC
BC
A
B
C
Ø
Five optimizations
■
■
■
■
■
Smallest-parent
Cache-results
Amortize-scans
Share-sorts
Share-partitions
Smallest-parent
■
■
■
Compute a node from its smallest
ancestor that has been already
computed
Less aggregating attributes =>
more aggregation => less tuples
“A” can be computed by “ABC”,
“AB” or “AC”. Choose the smallest
Cache-results
■
■
■
■
Hold in main memory as many
computed group bys as possible
This will save I/O cost which is the
bottleneck
Ideally, if everything could be stored
in main memory, the cube would be
computed through one scan
But this is not feasible in real data…
Amortize-scans
■
■
■
■
■
■
Create an “optimum” plan for the
computation of group bys (NP-hard
problem)
This is equivalent to pruning the
lattice and creating tree
Try to have as many group bys as
possible in main memory
Breadth-first seems not to be
efficient
Perhaps depth-first?
Or other heuristics…
Share-sorts
■
■
■
■
■
Share the cost of sorting among
multiple group bys that need it
Applies only to sort-based methods
Extensive use of pipelining
If you sort in order “ABC”, then the
result is also sorted in order “AB”
and “A”
Find common prefixes or partially
matching sort orders
Share-partitions
■
■
■
Share the cost of partitioning
among multiple group bys that need
it
Applies only to hash-based
methods
Extensive use of pipelining
But…
■
■
■
These five optimizations can be
contradictory
For example share-sorts implies to
use “AB” for the computation of “A”.
But what if “AC” is much smaller
than “AB”?
So, a lot of methods have been
proposed. Each one of them is a
version of some combination of the
five ideas
Algorithms
■
■
■
■
■
■
PipeSort
Overlap
PipeHash
PartitionedCube
BottomUpCube
…
New ideas and relevant
topics
■
■
■
■
Multi-way array (value-based vs.
position-based algorithms)
Lossy methods (e.g. using
wavelets) if you don’t care about
accuracy
Materialize as much as possible.
But which nodes to select?
And if you materialize some, how
should queries be evaluated?