Download Parallelizing the Data Cube

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Parallelizing the Data Cube
PhD Oral Defence
Todd Eavis
July 23, 2003
Overview
●
Motivation for Parallel, Relational OLAP
●
Core Algorithms and Methods
●
Primary Systems Contributions
●
Experimental Evaluation and Results
●
Conclusions and Future Work
–
Motivation for Parallel, Relational OLAP
–
Core Algorithms and Methods
–
Primary Systems Contributions
–
Experimental Evaluation and Results
–
Conclusions and Future Work
Why study OLAP and the Data Cube?
●
On-line Analytical Processing: the foundation for a
range of essential business applications
–
●
●
●
●
sales and marketing analysis, planning and
budgeting
4 billion dollar industry by 2005
Data Cube: a core OLAP construct, first proposed
in 1996 by Gray et al [GBLP], that supports
sophisticated multi-dimensional data analysis
Relevance to the Research Community? Results of
Citeseer queries:
–
OLAP: 797 papers
–
Data Cube: 362 papers
Our interest: Data Cube
Generation and Querying
Scale of OLAP Data Warehouses
●
●
●
●
●
●
Average size of production data warehouses
currently 700 GB [survey.com/Olap Report]
Expected to reach 4 TB by 2004
1/3 currently <= 50 GB. In two years, this number
will drop to just 6%
Biggest data warehouses growing by a factor of 20
[Winter Report]
Biggest expected to exceed 100 TB within 2 years
Our Interest: Exploiting Parallel
Algorithms
Fundamental Design Alternatives
●
●
●
MOLAP (Multi-dimensional OLAP)
–
Materialize data cube as a multidimensional array
–
In theory: implicit indexing. In practice:
hybrid schemes for sparse and dense
regions
–
Best for dense, low-dimensional spaces
ROLAP (Relational OLAP)
–
Store data as relational tables
–
Requires an explicit multi-dimensional
index
–
Scales well to higher dimensions and
higher cardinalities
Our Interest: Highly Scalable ROLAP model
–
Motivation for Parallel, Relational OLAP
–
Core Algorithms and Methods
–
Primary Systems Contributions
–
Experimental Evaluation and Results
–
Conclusions and Future Work
Computing the Full Cube in Parallel
●
Small number of previous projects [GC, LHL,
MM, NWY]
–
●
Speedup quite limited
Our approach: Parallel Pipesort [DEHR2, DER3]
–
Model 2d views as a “task graph”
–
Create “Scan Pipelines” [AADG] as Minimum Cost
Spanning Tree using O(dn(m + nlogn)) bipartite
matching (n = nodes, m = edges)
–
Partition task graph into sub-trees with O(p3d +
p2d) augmented k-min-max [BSP] and distribute
sub-trees to p processors
–
Use over-sampling – S * p sub-trees – to improve
load balance. High-low pairing in S – 1 rounds
provides approximation to NP-Complete problem
–
Use performance optimized algorithms for sorting,
scanning, and I/O to generation local views
Computing Partial Cubes in Parallel
●
●
●
Important in practice for environments with higher
dimensions and/or specific visualization needs
Little previous work, only partial solutions
[BR,GC,SAG]
Our approach: Greedy algorithms for “Schedule
Tree” construction [DER2, DER5]
–
Solution consists of algorithms for generating
efficieint “Essential Trees” (red) and algorithms
for adding beneficial non-selected nodes (blue)
–
Greedy method: record “ state” information in Plan
Objects. Incrementally add nodes with maximum
benefit
–
Pre-sorting “candidate” views by estimated size
can reduce run-time from O(n3) to O(n2)
–
O(d*n) heuristic extensions for higher dimensional
space. A “confidence factor” β limits risk
A Parallel Date Cube Query Engine
●
●
●
●
Views must be indexed prior to access
Related work: sequential r-trees for data cube
[RYR] and general purpose parallel r-trees [SL]
Our Approach: Parallel RCUBE [DER1, DER6]
–
Records ordered as per Hilbert Space Filling
Curve
–
P-processor round-robin record striping
–
Construct Partial r-tree indexes on each node,
“packing” page blocks in Hilbert order
Parallel Query Engine
–
Combines indexing and OLAP post-processing
(query transformation, parallel Sample Sort,
record permutation, etc.)
–
Uses surrogate views to support Partial Cubes
–
Supports “linear” dimension hierarchies
The Virtual Data Cube
●
●
Motivation: Hide the complexity of
Data Cube algorithms and
implementation
Requires no knowledge of:
–
Format or extent of indexing
–
Degree of materialization (full or
partial)
–
Representation of hierarchies
–
Physical order of view attributes
–
Degree of parallelism
–
Motivation for Parallel, Relational OLAP
–
Core Algorithms and Methods
–
Primary Systems Contributions
–
Experimental Evaluation and Results
–
Conclusions and Future Work
Systems Overview
●
Full, robust Data Cube “prototype” [DER4]
●
Approximately 20,000 lines of code
–
●
●
●
C/C++, LEDA, MPI, STL
Template-based graph algorithms
Designed for, and evaluated on, contemporary
parallel machines:
–
“Shared nothing” Linux cluster
(Dalhousie)
–
“Shared disk” SunFire multi-processor
(HPCVL)
Supporting systems include:
–
Flex/Bison based data generator
–
Batch query generator
Key Performance Issues
●
Dynamic selection of “best” sorting algorithm
–
●
Minimization of data movement
–
●
Use of horizontal and vertical indirection
New pipeline aggregation algorithm
–
●
Radix sort versus quicksort
“Lazy” aggregation
Streamlined I/O
–
I/O manager
–
Independent I/O and computation threads
Costing model
●
●
Sophisticated cost model, common to
both full and partial cube [DEHR1]
Based upon view size estimator
–
●
●
Probabilistic counting technique
Experimentally supported metrics for:
–
Dynamic Sorting (linear time versus
comparison based)
–
In-memory scanning and data
movement
–
Machine specific “Read” and
“Write” I/O
Dynamically considers impact of
computation versus I/O
A Better Search Strategy
●
●
Standard r-tree search strategy employs
Depth First Search
Our approach: Linear Breadth First
Search
–
Map the search algorithm to the
linearly ordered levels of the packed
index
–
Resolve query with a left-to-right,
top-to-bottom walk of the tree
–
Disk head never moves backwards
–
Resolution consists of a sequence of
clustered scans
–
Degrades gracefully to a sequential
scan of index + sequential scan of
–
Motivation for Parallel, Relational OLAP
–
Core Algorithms and Methods
–
Primary Systems Contributions
–
Experimental Evaluation and Results
–
Conclusions and Future Work
Experimental Evaluation
●
Default test environment includes:
–
16 to 24 processors
–
2 million records/80 MB
–
4 to 14 dimensions
–
Random query batches
–
24-node Linux cluster, 16-node
SunFire MP (disk array)
Full Cube
●
●
Parallel Speedup approaching linear
for all components
Efficiency between 80 and 95%
Partial Cube
Query Processing
Full Cube Evaluation
●
●
●
Shared Disk
●
80 – 90%
efficiency
●
Disk array is
bottleneck
Over sampling
factor
SF = 2
consistently best
●
●
Optimized
pipeline
processing
Order of
magnitude
improvement
Partial Cube Evaluation
●
Using partial cube
algorithms for full
cube
●
●
●
●
All within 6% of
“best” benchmark
Recursive
algorithm with
0.1%
Partial cube of 3
dimensions or less
Reductions over
“naïve” method of
65 – 70%
●
●
●
Tree pruning with
confidence factor (on
14 d)
Can eliminate up to
60% of original tree
Virtually no
reduction in tree
quality
Query Evaluation
●
●
●
Ratio of blocks
retrieved to
required seeks
●
Random query
batches
●
Up to 140:1 for
large, sparse spaces
Record retrieval
imbalance (16
processors)
●
Only 0.3% from
optimal load
balance
●
Overhead of using
surrogate views (1
to 16 processors)
Run time on
materialized views
versus time when
those views were
unavailable
–
Motivation for Parallel, Relational OLAP
–
Core Algorithms and Methods
–
Primary Systems Contributions
–
Experimental Evaluation and Results
–
Conclusions and Future Work
Thesis Conclusions
●
●
●
●
ROLAP a viable alternative to MOLAP in
parallel setting
Partial cubes can be efficiently generated
ROLAP cubes can be efficiently indexed
Virtual cube abstraction can be efficiently
supported
Research Highlights
●
First parallel ROLAP system in the Data Cube literature
●
A balanced approach to data cube research
●
●
–
Algorithm design
–
Systems engineering
–
Extensive performance analysis
Evaluated on contemporary parallel machines
–
Commodity-style shared nothing cluster
–
Shared disk architectures
Integration of three “independent” data cube research projects into
a single cohesive OLAP framework – the Virtual Cube
Future Work
●
Automated partial cube specification
–
●
Extension of “virtual cube”
Parallel Query optimization
–
In addition to range queries or linear hierarchies
–
High volume query environments
●
OLAP visualization
●
New projects are building on the current base
–
Generation of Iceberg Cubes
–
Mining of association rules
Thank You!
References: Our own Virtual
Data Cube Research
References: The Data Cube
Literature
DEHR1 F. Dehne, T. Eavis, S. Hambrusch and A. Rau-Chaplin,
Parallelizing The Data Cube, Parallel and Distributed Databases: An
International Journal, 2001
GBLP J. Gray and A. Bosworth and A. Layman and H. Pirahesh", Data Cube: A
Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and SubTotals", ICDE, 1996
DER1 F. Dehne, Todd Eavis, A. Rau-Chaplin, Distributed Multidimensional ROLAP Indexing for the Data Cube, CCGrid, 2003.
GC S. Goil and A. Choudhary, A Parallel Scalable Infrastructure for OLAP and
Data Mining, IDEAS,1999
LHL H. Lu, X. Huang and Z. Li,, Computing Data Cubes Using Massively
DER2 F. Dehne, T. Eavis and A. Rau-Chaplin, Computing Partial Data
Cubes for Parallel Data Warehousing Applications, Euro PVM-MPI, 2001. Parallel Processors, PCW '97, 1997
DER3 F. Dehne, T. Eavis, and A. Rau-Chaplin, Coarse Grained Parallel On- MM S. Muto and M. Kitsuregawa, A dynamic Load Balancing Strategy for
Parallel Datacube Computation, WDW O,1999
Line Analytical Processing (OLAP) For Data Mining, ICCS, 2001.
DER4 F. Dehne, T. Eavis, and A. Rau-Chaplin, A Cluster Architecture for
Parallel Data Warehousing, CCGrid, 2001.
NWY R. Ng, A. Wagner and Y. Yin, Iceberg-cube Computation with PC Clusters,
SIGMOD, 2001
DEHR2 F. Dehne, S. Hambrusch, T. Eavis, and A. Rau-Chaplin,
Parallelizing The Data Cube, ICDT, 2001.
AADG S. Agarwal and R. Agrawal and P. Deshpande and A. Gupta and J.
Naughton and R. Ramakrishnan and S. Sarawagi, On the Computation of
Multidimensional aggregates, VLDB, 1996
CHER Y. Chen, F.Dehne, Todd Eavis, A. Rau-Chaplin, Parallel ROLAP
BSP R. Becker and S. Schach and Y. Perl, A shifting algorithm for min-max tree
Data Cube Construction on Shared Nothing Multi-Processors, IPDPS, 2003.
partitioning, Journal of the ACM, 1982
BR K. Beyer and R. Ramakrishnan, Bottom-up computation of sparse and Iceberg
DER5 F Dehne, T.Eavis, and A. Rau-Chaplin, Computing Partial Data
CUBEs, SIGMOD,1999
Cubes, Submitted to HICCS, 2003.
RYR N. Roussopoulos and Y. Kotidis and M. Roussopolis, Cubetree:
Organization of the bulk incremental updates on the data cube, SIGMOD, 1997
DER6 F Dehne, T.Eavis, and A. Rau-Chaplin, RCUBE: Parallel MultiDimensional ROLAP Indexing, Submitted to Journal of to Data Mining and
SL B. Schnitzer and S. Leutenegger, Master-client r-trees: a new parallel
Knowledge Discovery., 2003
architecture, SSDM, 1999