Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor: Gagan Agrawal Candidacy Examination 1 Big Data Is Often Big Arrays • Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data 2 Inherent Limitations of Current Tools and Paradigms • Most scientific data management and data processing tools are too heavy-weight – Hard to cope with different data formats and physical structures (variety) – Data transformation and data transfer are often prohibitively expensive (volume) • Prominent Examples – RDBMSs: not suited for array data – Array DBMSs: data ingestion – MapReduce: specialized file system 3 Mismatch Between Scientific Data and DBMS • Scientific (Array) Datasets: – Very large but processed infrequently – Read/append only – No resources for reloading data – Popular formats: NetCDF and HDF5 • Database Technologies – For (read-write) data – ACID guaranteed – Assume data reloading/reformatting feasible 4 Example Array Data Format - HDF5 • HDF5 (Hierarchical Data Format) 5 The Upfront Cost of Using SciDB • High-Level Data Flow – Requires data ingestion • Data Ingestion Steps – Raw files (e.g., HDF5) -> CSV – Load CSV files into SciDB “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al. 6 Thesis Statement • Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions – Process data stored in the native format, e.g., NetCDF and HDF5 – Support SQL-like operators, e.g., selection and aggregation – Support array operations, e.g., structural aggregations – Support MapReduce-like processing API 7 Outline • Data Management Support – Supporting a Light-Weight Data Management Layer Over HDF5 – SAGA: Array Storage as a DB with Support for Structural Aggregations – Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support – SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work 8 Overall Idea • An SQL Implementation Over HDF5 – Ease-of-use: declarative language instead of lowlevel programming language + HDF5 API – Abstraction: provides a virtual relational view • High Efficiency – Load data on demand (lazy loading) – Parallel query processing – Server-side aggregation 9 Functionality • Query Based on Dimension Index Values (Type 1) – Also supported by HDF5 API index-based condition • Query Based on Dimension Scales (Type 2) – coordinate system instead of the physical layout (array subscript) coordinate-based condition • Query Based on Data Values (Type 3) – Simple datatype + compound datatype • Aggregate Query content-based condition – SUM, COUNT, AVG, MIN, and MAX – Server-side aggregation to minimize the data transfer 10 Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list 1D: OR-logic condition list Same contentbased condition 11 11 Experimental Setup • Experimental Datasets – 4 GB (sequential experiments) and 16 GB (parallel experiments) – 4D: time, cols, rows, and layers • Compared with Baseline Performance and OPeNDAP – Baseline performance: no query parsing – OPeNDAP: translates HDF5 into a specialized data format 12 Sequential Comparison with OPeNDAP (Type2 and Type3 Queries) 450 Execution Times (sec) 400 Baseline OPeNDAP Sequential 350 300 250 200 150 100 50 0 <20% 13 20%-40% 40%-60% 60%-80% >80% Parallel Query Processing for Type2 and Type3 Queries 250 1 node Execution Times (sec) 200 2 nodes 4 nodes 150 100 8 nodes 16 nodes 50 0 <20% 14 20%-40% 40%-60% 60%-80% >80% Outline • Data Management Support – Supporting a Light-Weight Data Management Layer Over HDF5 – SAGA: Array Storage as a DB with Support for Structural Aggregations – Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support – SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work 15 Array Storage as a DB • A Paradigm Similar to NoDB – Still maintains DB functionality – But no data ingestion • DB and Array Storage as a DB: Friends or Foes? – When to use DB? • Load once, and query frequently – When to directly use array storage? • Query infrequently, so avoid loading • Our System – Focuses on a set of special array operations Structural Aggregations 16 Structural Aggregation Types Non-Overlapping Aggregation Overlapping Aggregation 17 Grid Aggregation • Parallelization: Easy after Partitioning • Considerations – Data contiguity which affects the I/O performance – Communication cost – Load balancing for skewed data • Partitioning Strategies – – – – Coarse-grained Fine-grained Hybrid Auto-grained 18 Partitioning Strategy Decider • Cost Model: analyze loading cost and computation cost separately – Load cost • Loading factor × data amount – Computation cost • Exception - Auto-Grained: take loading cost and computation cost as a whole 19 Overlapping Aggregation • I/O Cost – Reuse the data already in the memory – Reduce the disk I/O to enhance the I/O performance • Memory Accesses – Reuse the data already in the cache – Reduce cache misses to accelerate the computation • Aggregation Approaches – Naïve approach – Data-reuse approach – All-reuse approach 20 Example: Hierarchical Aggregation • Aggregate 3 grids in a 6 × 6 array – The innermost 2 × 2 grid – The middle 4 × 4 grid – The outmost 6 × 6 grid • (Parallel) sliding aggregation is much more complicated 21 Naïve Approach 1. 2. 3. 4. 5. 6. Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations 22 Data-Reuse Approach 1. 2. 3. 4. Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations 23 All-Reuse Approach 1. Load the outermost grid 2. Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the Update both the outermost Update all the 3 outermost aggregation and the middle aggregation aggregation result results results 24 Sequential Performance Comparison Grid Aggregation Times (secs) 180 Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation 1200 160 SciDB 140 SAGA_FLAT 120 SAGA_CHUNKED Sliding Aggregation Times (secs) • • • • 100 80 60 40 20 1000 SciDB_Non-Overlapping SciDB_Overlapping SAGA_FLAT 800 SAGA_CHUNKED 600 400 200 0 0 12.5% 25% 50% Array Slab / Dataset Size Ratio 100% 12.5% 25% 50% 100% Array Slab / Dataset Size Ratio 25 Parallel Sliding Aggregation Performance • # of nodes: from 1 to 16 • 8 GB data • Sliding grid size: from 3 × 3 to 6 × 6 Avg. Aggregation Times (secs) 3000 Naive (Coarse-Grained) 2500 Data-Reuse 2000 All-Reuse 1500 1000 500 0 1 2 4 # of Nodes 8 16 26 Outline • Data Management Support – Supporting a Light-Weight Data Management Layer Over HDF5 – SAGA: Array Storage as a DB with Support for Structural Aggregations – Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support – SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work 27 Approximate Aggregations Over Array Data • Challenges – Flexible Aggregation Over Any Subset • Dimensional-based/value-based/combined predicate – Aggregation Accuracy • Spatial distribution/value distribution – Aggregation Without Data Reorganization • Reorganization is prohibitively expensive • Existing Techniques - All Problematic for Array Data – Sampling: unable to capture both distributions – Histograms: no spatial distribution – Wavelets: no value distribution • New Data Synopses – Bitmap Indices 28 Bitmap Indexing and Pre-Aggregation • Bitmap Indices • Pre-Aggregation Statistics 29 Approximate Aggregation Workflow 30 Running Example • Bitmap Indices SELECT SUM(Array) WHERE Value > 3 AND ID < 4; Predicate Bitvector: 11110000 i1’: 01000000 i2’: 10010000 • Pre-Aggregation Statistics Count1: 1 Count2: 2 Estimated Sum: 7 × 1/2 + 16 × 2/3 = 14.167 Precise Sum: 14 31 A Novel Binning Strategy • Conventional Binning Strategies – Equi-width/Equi-depth – Not designed for aggregation • V-Optimized Binning Strategy – Inspired by V-Optimal Histogram – Goal: approximately minimize Sum Squared Error (SSE) – Unbiased V-Optimized Binning: data is queried randomly – Weighted V-Optimized Binning: frequently queried subarea is prior knowledge 32 Unbiased V-Optimized Binning • 3 Steps: 1) Initial Binning: use equi-depth binning 2) Iterative Refinement: adjusting bin boundaries 3) Bitvector Generation: mark spatial positions 33 Weighted V-Optimized Binning • Difference: minimize WSSE instead of SSE • Similar binning algorithm • Major Modification – representative value for each bin is not the mean value 34 Experimental Setup • Data Skew 1) 2) Dense Range: less than 5% space but over 90% data Sparse Range: less than 95% space but over 10% data • 5 Types of Queries 1) 2) 3) 4) 5) DB: with dimension-based predicates VBD: with value-based predicates over dense range VBS : with value-based predicates over sparse range CD: with combined predicates over dense range CS : with combined predicates over sparse range • Ratio of Querying Possibilities – 10 : 1 – 50% synthetic data is frequently queried – 25% real-world data is frequently queried 35 SUM Aggregation Accuracy of Different Binning Strategies on the Synthetic Dataset Equi-Width Equi-Depth Unbiased V-Optimized Weighted V-Optimized 36 SUM Aggregation Accuracy of Different Methods on the Real-World Dataset Sampling_2% Sampling_20% (Equi-Depth) MD-Histogram Equi-Depth Unbiased V-Optimized Weighted V-Optimized 37 Outline • Data Management Support – Supporting a Light-Weight Data Management Layer Over HDF5 – SAGA: Array Storage as a DB with Support for Structural Aggregations – Approximate Aggregations Using Novel Bitmap Indices • Data Processing Support – SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats • Future Work 38 Scientific Data Analysis Today • “Store-First-Analyze-After” – Reload data into another file system • E.g., load data from PVFS to HDFS – Reload data into another data format • E.g., load NetCDF/HDF5 data to a specialized format • Problems – Long data migration/transformation time – Stresses network and disks 39 System Overview • Key Feature – scientific data processing module 40 Scientific Data Processing Module 41 Parallel Data Processing Times on 16 GB Datasets • KNN • K-Means NetCDF 70 HDF5 NetCDF HDF5 140 FLAT FLAT 120 Avg. Data Processing Times (sec) Avg. Data Processing Times (sec) 60 160 50 100 40 30 20 10 0 1 2 # of Threads 4 8 80 60 40 20 0 1 2 4 8 16 # of Nodes 42 Future Work Outline • Data Management Support – SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices – SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices • Data Processing Support – StreamingMATE: A Novel MapReduce-Like Framework Over Scientific Data Stream 43 SciSD • Subgroup Discovery – Goal: identify all the subsets that are significantly different from the entire dataset/general population, w.r.t. a target variable – Can be widely used in scientific knowledge discovery • Novelty – Subsets can involve dimensional and/or value ranges – All numeric attributes – High efficiency by frequent bitmap-based approximate aggregations 44 Running Example 45 SciCSM • “Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more.” - Darby Conley, Get Fuzzy, 2001 • Contrast Set Mining – Goal: identify all the filters that can generate significantly different subsets – Common filters: time periods, spatial areas, etc. – Usage: classifier design, change detection, disaster prediction, etc. 46 Running Example 47 StreamingMATE • Extend the precursor system SciMATE to process scientific data stream • Generalized Reduction – Reduce data stream to a reduction object – No shuffling or sorting • Focus on the load balancing issues – Input data volume can be highly variable – Topology update: add/remove/update streaming operators 48 StreamingMATE Overview 49 50 False: nullify the condition list Hyperslab Selector True: nullify the elementary condition 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values 51 Type2 and Type3 Query Examples 52 Aggregation Query Examples • AG1: Simple global aggregation • AG2: GROUP BY clause + HAVING clause • AG3: GROUP BY clause 53 Sequential and Parallel Performance of Aggregation Queries 140 120 OPeNDAP Execution Times (sec) 2 nodes 100 4 nodes 80 60 8 nodes 16 nodes 40 20 0 AGGI AGGII AGIII Aggregation Types 54 Array Databases • Examples: SciDB, RasDaMan and MonetDB • Take Array as the First-Class Citizens – Everything is defined in the array dialect • Lightweight or No ACID Maintenance – No write conflict: ACID is inherently guaranteed • Other Desired Functionality – Structural aggregations, array join, provenance… 55 Structural Aggregations • Aggregate the elements based on positional relationships – E.g., moving average: calculates the average of each 2 × 2 square from left to right 1 2 3 4 5 6 7 8 Input Array 3.5 4.5 5.5 Aggregation Result aggregate the elements in the same square at a time 56 Coarse-Grained Partitioning • Pros – Low I/O cost – Low communication cost • Cons – Workload imbalance for skewed data 57 Fine-Grained Partitioning • Pros – Excellent workload balance for skewed data • Cons – Relatively high I/O cost – High communication cost 58 Hybrid Partitioning • Pros – Low communication cost – Good workload balance for skewed data • Cons – High I/O cost 59 Auto-Grained Partitioning • 2 Steps – Estimate the grid density (after filtering) by sampling, and thus, estimate the computation cost (based on the time complexity) • For each grid, total processing cost = constant loading cost + varying computation cost – Partitions the cost array - Balanced Contiguous Multi-Way Partitioning • Dynamic programming (small # of grids) • Greedy (large # of grids) 60 Auto-Grained Partitioning (Cont’d) • Pros – Low I/O cost – Low communication cost – Great workload balance for skewed data • Cons – Overhead of sampling an runtime partitioning 61 Partitioning Strategy Summary Strategy I/O Performance Workload Scalability Additional Balance Cost Coarse-Grained Excellent Poor Excellent None Fine-Grained Poor Excellent Poor None Hybrid Poor Good Good None Auto-Grained Great Great Great Nontrivial Our partitioning strategy decider can help choose the best strategy 62 All-Reuse Approach (Cont’d) • Key Insight – # of aggregates ≤ # of queried elements – More computationally efficient to iterate over elements and update the associated aggregates • More Benefits – Load balance (for hierarchical/circular aggregations) – More speedup for compound array elements • The data type of an aggregate is usually primitive, but this is not always true for an array element 63 Parallel Grid Aggregation Performance • Used 4 processors on a Real-Life Dataset of 8 GB • User-Defined Aggregation: K-Means – Vary the number of iterations to vary to the computation amount 250 Coarse-Grained Aggregation Times (secs) 200 Fine-Grained Hybrid Auto-Grained 150 100 50 0 2 4 6 # of Iterations 8 10 12 64 Data Access Strategies and Patterns • Full Read: probably too expensive for reading a small data subset • Partial Read – Strided pattern – Column pattern – Discrete point pattern 65 Indexing Cost of Different Binning Strategies with Varying # of Bins on the Synthetic Dataset 3,500 Equi-Width Indexing Times (secs) 3,000 Equi-Depth 2,500 V-Optimized 2,000 1,500 1,000 500 0 50 100 200 400 800 # of Bins 66 SUM Aggregation of Equi-Width Binning with Varying # of Bins on the Synthetic Dataset 1.0E+04 50 100 200 1.0E+03 Avg. Relative Error 400 800 1.0E+02 1.0E+01 1.0E+00 0.005% 0.01% 0.02% 0.05% 0.1% 0.2% 0.5% 1% 2% Coverage 67 SUM Aggregation of Equi-Depth Binning with Varying # of Bins on the Synthetic Dataset 1.0E+03 50 100 1.0E+02 200 Avg. Relative Erorr 400 800 1.0E+01 1.0E+00 1.0E-01 1.0E-02 0.005% 0.01% 0.02% 0.05% 0.1% 0.2% 0.5% 1% 2% Coverage 68 SUM Aggregation of V-Optimized Binning with Varying # of Bins on the Synthetic Dataset 0.35 50 Avg. Relative Erorr 0.30 100 200 0.25 400 800 0.20 0.15 0.10 0.05 0.00 0.005% 0.01% 0.02% 0.05% 0.1% 0.2% 0.5% 1% 2% Coverage 69 Average Relative Error(%) of MAX Aggregation of Different Methods on the Real-World Dataset 70 SUM Aggregation Times of Different Methods on the Real-World Dataset (DB) 1.0E+04 Aggregation Times (secs) Sampling_2% Sampling_20% Equi-Depth V-Optimized Precise Aggregation 1.0E+03 1.0E+02 1.0E+01 1.0E+00 10% 20% 30% 40% 50% 60% 70% 80% 90% Coverage 71 SUM Aggregation Times of Different Methods on the Real-World Dataset (VBD) 1.0E+04 Sampling_2% Sampling_20% Equi-Depth V-Optimized Precise Aggregation Aggregation Times (secs) 1.0E+03 1.0E+02 1.0E+01 1.0E+00 1.0E-01 1.0E-02 1.0E-03 1.0E-04 10% 20% 30% 40% 50% 60% 70% 80% 90% Coverage 72 SUM Aggregation Times of Different Methods on the Real-World Dataset (VBS) 1.0E+04 Sampling_2% Sampling_20% Equi-Depth V-Optimzed Precise Aggregation Aggregation Times (secs) 1.0E+03 1.0E+02 1.0E+01 1.0E+00 1.0E-01 1.0E-02 1.0E-03 1.0E-04 0.01% 0.02% 0.05% 0.1% 0.2% 0.5% 1% 2% 5% Coverage 73 SUM Aggregation Times of Different Methods on the Real-World Dataset (CD) Aggregation Times (secs) 1.0E+04 Sampling_2% Sampling_20% Equi-Depth Unbiased V-Optimized Weighted V-Optimized Precise Aggregation 1.0E+03 1.0E+02 1.0E+01 1.0E+00 10% 20% 30% 40% 50% 60% 70% 80% 90% Coverage 74 SUM Aggregation Times of Different Methods on the Real-World Dataset (CD) Sampling_2% Sampling_20% Equi-Depth V-Optimzed Precise Aggregation Aggregation Times (secs) 1.0E+03 1.0E+02 1.0E+01 1.0E+00 1.0E-01 0.005% 0.01% 0.02% 0.05% 0.1% 0.2% 0.5% 1% 2% Coverage 75 SD vs. Classification 76