Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Parallel Data Cube F ord B y M a ke & Y e ar C he vy B y Y ear 1 99 0 1 99 1 1 99 2 1 99 3 Data Mining B y M ak e Red OLAP (On-line analytical processing) W hite B lu e B y C o lo ur & Y e ar cube / group-by operator in SQL B y M a ke & C olou r B y C olou r Frank Dehne www.dehne.net Data Warehousing for Decision Support Query Reports Analysis Data Mining Front-End Tools Olap Server Output Olap Server Olap Engines Monitoring Administration Data Warehouse Data Storage Meta Data Repository Extract Clean Transform Load Refresh Operational Databases Frank Dehne Data Marts External Sources Data Cleaning and Integration • Operational data collected into DW • DW used to support multi-dimensional views • Views form the basis of OLAP processing • Our focus: the OLAP server www.dehne.net Multi-dimensional views • Collection of feature attributes • Aggregate along one or more measure attributes • Reduce the granularity by collapsing dimensions F ord B y M a ke & Y e ar C he vy B y Y ear 1 99 0 1 99 1 1 99 2 1 99 3 B y M ak e Red W hite B lu e B y C o lo ur & Y e ar B y M a ke & C olou r B y C olou r Frank Dehne www.dehne.net Data Cube Generation • Proposed by Gray et al (Microsoft) in 1995 • Exploits the relationship between cuboids to compute all 2d cuboids • In OLAP views are typically pre-computed to improve query response time Frank Dehne ABCD ABC AC AB A ABD ACD AD BC B C BCD BD CD D All www.dehne.net Sequential Solutions • Top Down Cube – Compute high dimension views first – Exploit shared dimensions – Pipesort – PipeHash Frank Dehne • Bottom Up Cube – Minimizes external memory sorting by partitioning first on single attributes • ArrayCube www.dehne.net Sequential Solutions MOLAP • array representation • easy to build and query • large storage • needs translation from/to relational model Frank Dehne ROLAP • relational data representation • harde to build and query • smaller storage • no translation from/to relational model www.dehne.net Top Down Cube (Pipesort) ABCD ABC AC AB A ABD ACD AD BC B C All Frank Dehne BCD BD CD D • Construct the data cube lattice • Estimate the edge costs • Find a least cost spanning tree • Compute the views by following the “pipes” www.dehne.net Optimizations – Share-sorts - sharing sorting cost across multiple group-bys. – Smallest parent - computing a cuboid from the smallest previously computed parent. – Cache results - reduce I/O by caching (in memory) parent views from which other cuboids are computed. – Amortize disk-scans - compute as many child views as possible when scanning each parent. Frank Dehne www.dehne.net Bottom Up Cube ABCD ABC AC AB A ABD ACD AD BC B C BCD BD CD D All Frank Dehne www.dehne.net Bottom Up Cube C1 B1 C2 A1 B2 B3 B4 A2 A3 Frank Dehne D1 D2 • Partition large view into memory-sized units • Perform sorting operations in memory • May significantly reduce external memory processing www.dehne.net Our Results – Parallel top-down ROLAP cube construction for shared disks (Distributed and Parallel Databases, 2002) – Parallel top-down ROLAP cube construction for distributed disks (IPDPS 2002) – Parallel bottom-up ROLAP cube construction for shared and distributed disks (Distributed and Parallel Databases, 2002) – Parallel ROLAP cube indexing for distributed disks (CCGrid 2003) Frank Dehne www.dehne.net Our Results – Parallel top-down ROLAP cube construction for shared disks (Distributed and Parallel Databases, 2002) – Parallel top-down ROLAP cube construction for distributed disks (IPDPS 2002) – Parallel bottom-up ROLAP cube construction for shared and distributed disks (Distributed and Parallel Databases, 2002) – Parallel ROLAP cube indexing for distributed disks (CCGrid 2003) Frank Dehne www.dehne.net Parallel top-down ROLAP cube Our approach: – Partition the load in advance and assign cuboids to individual processors – Local computation exploits existing optimized sequential algorithms (ROLAP) – Communication is reduced to a single phase in which work lists are distributed Frank Dehne www.dehne.net Parallel top-down ROLAP cube CBAD CBA BAD ACD AC AD BA A B Frank Dehne CB DB C All BCD CD D • Cut the process tree into p “equal weight” sub-trees • Each processor independently generates cuboids from its own sub-tree • Load balance/stripe the output www.dehne.net Tree Partitioning • Optimal tree partitioning is NP-complete • Min-max tree k-partitioning: Given a tree T with n vertices and a positive weight assigned to each vertex, delete k edges in the tree to obtain k connected components T1, T2, ... Tk+1 such that the largest total weight of a resulting sub-tree is minimized. • O(n) time, Frederickson 1990 • O(Rk(k + log d)+n) time - Becker, Perl and Schach ‘82 Frank Dehne www.dehne.net Over Sampling p subtrees s * p subtrees p subsets Frank Dehne www.dehne.net Time vs. #Proc Frank Dehne www.dehne.net For more information ... http://cgm.dehne.net Frank Dehne www.dehne.net Frank Dehne www.dehne.net