Download Multi-core Implementations of the Concurrent Collections

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Management and Data Processing
Support on Array-Based Scientific Data
Yi Wang
Advisor: Gagan Agrawal
Candidacy Examination
1
Big Data Is Often Big Arrays
• Array data is everywhere
Molecular
Simulation:
Molecular
Data
Life Science:
DNA
Sequencing
Data
(Microarray)
Earth
Science:
Ocean and
Climate Data
Space
Science:
Astronomy
Data
2
Inherent Limitations of Current Tools
and Paradigms
• Most scientific data management and data
processing tools are too heavy-weight
– Hard to cope with different data formats and physical
structures (variety)
– Data transformation and data transfer are often
prohibitively expensive (volume)
• Prominent Examples
– RDBMSs: not suited for array data
– Array DBMSs: data ingestion
– MapReduce: specialized file system
3
Mismatch Between Scientific Data and
DBMS
• Scientific (Array) Datasets:
– Very large but processed infrequently
– Read/append only
– No resources for reloading data
– Popular formats: NetCDF and HDF5
• Database Technologies
– For (read-write) data – ACID guaranteed
– Assume data reloading/reformatting feasible
4
Example Array Data Format - HDF5
• HDF5 (Hierarchical Data Format)
5
The Upfront Cost of Using SciDB
• High-Level Data Flow
– Requires data ingestion
• Data Ingestion Steps
– Raw files (e.g., HDF5) -> CSV
– Load CSV files into SciDB
“EarthDB: scalable analysis of MODIS data using SciDB”
- G. Planthaber et al.
6
Thesis Statement
• Native Data Can Be Queried and/or Processed
Efficiently Using Popular Abstractions
– Process data stored in the native format, e.g.,
NetCDF and HDF5
– Support SQL-like operators, e.g., selection and
aggregation
– Support array operations, e.g., structural
aggregations
– Support MapReduce-like processing API
7
Outline
• Data Management Support
– Supporting a Light-Weight Data Management Layer
Over HDF5
– SAGA: Array Storage as a DB with Support for
Structural Aggregations
– Approximate Aggregations Using Novel Bitmap
Indices
• Data Processing Support
– SciMATE: A Novel MapReduce-Like Framework for
Multiple Scientific Data Formats
• Future Work
8
Overall Idea
• An SQL Implementation Over HDF5
– Ease-of-use: declarative language instead of lowlevel programming language + HDF5 API
– Abstraction: provides a virtual relational view
• High Efficiency
– Load data on demand (lazy loading)
– Parallel query processing
– Server-side aggregation
9
Functionality
• Query Based on Dimension Index Values (Type 1)
– Also supported by HDF5 API
index-based condition
• Query Based on Dimension Scales (Type 2)
– coordinate system instead of the physical layout (array
subscript)
coordinate-based condition
• Query Based on Data Values (Type 3)
– Simple datatype + compound datatype
• Aggregate Query
content-based condition
– SUM, COUNT, AVG, MIN, and MAX
– Server-side aggregation to minimize the data transfer
10
Execution Overview
1D: AND-logic
condition list
2D: OR-logic
condition list
1D: OR-logic
condition list
Same contentbased condition
11
11
Experimental Setup
• Experimental Datasets
– 4 GB (sequential experiments) and 16 GB
(parallel experiments)
– 4D: time, cols, rows, and layers
• Compared with Baseline Performance and
OPeNDAP
– Baseline performance: no query parsing
– OPeNDAP: translates HDF5 into a specialized data
format
12
Sequential Comparison with
OPeNDAP (Type2 and Type3 Queries)
450
Execution Times (sec)
400
Baseline
OPeNDAP
Sequential
350
300
250
200
150
100
50
0
<20%
13
20%-40%
40%-60%
60%-80%
>80%
Parallel Query Processing for Type2
and Type3 Queries
250
1 node
Execution Times (sec)
200
2 nodes
4 nodes
150
100
8 nodes
16 nodes
50
0
<20%
14
20%-40%
40%-60%
60%-80%
>80%
Outline
• Data Management Support
– Supporting a Light-Weight Data Management Layer
Over HDF5
– SAGA: Array Storage as a DB with Support for
Structural Aggregations
– Approximate Aggregations Using Novel Bitmap
Indices
• Data Processing Support
– SciMATE: A Novel MapReduce-Like Framework for
Multiple Scientific Data Formats
• Future Work
15
Array Storage as a DB
• A Paradigm Similar to NoDB
– Still maintains DB functionality
– But no data ingestion
• DB and Array Storage as a DB: Friends or Foes?
– When to use DB?
• Load once, and query frequently
– When to directly use array storage?
• Query infrequently, so avoid loading
• Our System
– Focuses on a set of special array operations Structural Aggregations
16
Structural Aggregation Types
Non-Overlapping
Aggregation
Overlapping
Aggregation
17
Grid Aggregation
• Parallelization: Easy after Partitioning
• Considerations
– Data contiguity which affects the I/O performance
– Communication cost
– Load balancing for skewed data
• Partitioning Strategies
–
–
–
–
Coarse-grained
Fine-grained
Hybrid
Auto-grained
18
Partitioning Strategy Decider
• Cost Model: analyze loading cost and
computation cost separately
– Load cost
• Loading factor × data amount
– Computation cost
• Exception - Auto-Grained: take loading cost and
computation cost as a whole
19
Overlapping Aggregation
• I/O Cost
– Reuse the data already in the memory
– Reduce the disk I/O to enhance the I/O performance
• Memory Accesses
– Reuse the data already in the cache
– Reduce cache misses to accelerate the computation
• Aggregation Approaches
– Naïve approach
– Data-reuse approach
– All-reuse approach
20
Example: Hierarchical Aggregation
• Aggregate 3 grids in a 6 × 6 array
– The innermost 2 × 2 grid
– The middle 4 × 4 grid
– The outmost 6 × 6 grid
• (Parallel) sliding aggregation is much more
complicated
21
Naïve Approach
1.
2.
3.
4.
5.
6.
Load the innermost grid
Aggregate the innermost grid
Load the middle grid
Aggregate the middle grid
Load the outermost grid
Aggregate the outermost grid
For N grids:
N loads + N aggregations
22
Data-Reuse Approach
1.
2.
3.
4.
Load the outermost grid
Aggregate the outermost grid
Aggregate the middle grid
Aggregate the innermost grid
For N grids:
1 load + N aggregations
23
All-Reuse Approach
1. Load the outermost grid
2. Once an element is accessed,
accumulatively update the
aggregation results it contributes
to
For N grids:
1 load + 1 aggregation
Only update the
Update both the outermost Update all the 3
outermost aggregation and the middle aggregation aggregation
result
results
results
24
Sequential Performance Comparison
Grid Aggregation Times (secs)
180
Array slab/data size (8 GB) ratio: from 12.5% to 100%
Coarse-grained partitioning for the grid aggregation
All-reuse approach for the sliding aggregation
SciDB stores `chunked’ array: can even support overlapping chunking to
accelerate the sliding aggregation
1200
160
SciDB
140
SAGA_FLAT
120
SAGA_CHUNKED
Sliding Aggregation Times (secs)
•
•
•
•
100
80
60
40
20
1000
SciDB_Non-Overlapping
SciDB_Overlapping
SAGA_FLAT
800
SAGA_CHUNKED
600
400
200
0
0
12.5%
25%
50%
Array Slab / Dataset Size Ratio
100%
12.5%
25%
50%
100%
Array Slab / Dataset Size Ratio
25
Parallel Sliding Aggregation
Performance
• # of nodes: from 1 to 16
• 8 GB data
• Sliding grid size: from 3 × 3 to 6 × 6
Avg. Aggregation Times (secs)
3000
Naive (Coarse-Grained)
2500
Data-Reuse
2000
All-Reuse
1500
1000
500
0
1
2
4
# of Nodes
8
16
26
Outline
• Data Management Support
– Supporting a Light-Weight Data Management Layer
Over HDF5
– SAGA: Array Storage as a DB with Support for
Structural Aggregations
– Approximate Aggregations Using Novel Bitmap
Indices
• Data Processing Support
– SciMATE: A Novel MapReduce-Like Framework for
Multiple Scientific Data Formats
• Future Work
27
Approximate Aggregations Over Array
Data
• Challenges
– Flexible Aggregation Over Any Subset
• Dimensional-based/value-based/combined predicate
– Aggregation Accuracy
• Spatial distribution/value distribution
– Aggregation Without Data Reorganization
• Reorganization is prohibitively expensive
• Existing Techniques - All Problematic for Array Data
– Sampling: unable to capture both distributions
– Histograms: no spatial distribution
– Wavelets: no value distribution
• New Data Synopses – Bitmap Indices
28
Bitmap Indexing and Pre-Aggregation
• Bitmap Indices
• Pre-Aggregation Statistics
29
Approximate Aggregation Workflow
30
Running Example
• Bitmap Indices
SELECT SUM(Array) WHERE Value > 3 AND ID < 4;
Predicate Bitvector: 11110000
i1’: 01000000
i2’: 10010000
• Pre-Aggregation Statistics
Count1: 1
Count2: 2
Estimated Sum: 7 × 1/2 + 16 × 2/3 = 14.167
Precise Sum: 14
31
A Novel Binning Strategy
• Conventional Binning Strategies
– Equi-width/Equi-depth
– Not designed for aggregation
• V-Optimized Binning Strategy
– Inspired by V-Optimal Histogram
– Goal: approximately minimize Sum Squared Error (SSE)
– Unbiased V-Optimized Binning: data is queried
randomly
– Weighted V-Optimized Binning: frequently queried
subarea is prior knowledge
32
Unbiased V-Optimized Binning
• 3 Steps:
1) Initial Binning: use equi-depth binning
2) Iterative Refinement: adjusting bin boundaries
3) Bitvector Generation: mark spatial positions
33
Weighted V-Optimized Binning
• Difference: minimize WSSE instead of SSE
• Similar binning algorithm
• Major Modification
– representative value for each bin is not the mean
value
34
Experimental Setup
• Data Skew
1)
2)
Dense Range: less than 5% space but over 90% data
Sparse Range: less than 95% space but over 10% data
• 5 Types of Queries
1)
2)
3)
4)
5)
DB: with dimension-based predicates
VBD: with value-based predicates over dense range
VBS : with value-based predicates over sparse range
CD: with combined predicates over dense range
CS : with combined predicates over sparse range
• Ratio of Querying Possibilities – 10 : 1
– 50% synthetic data is frequently queried
– 25% real-world data is frequently queried
35
SUM Aggregation Accuracy of Different
Binning Strategies on the Synthetic Dataset
Equi-Width
Equi-Depth
Unbiased V-Optimized
Weighted V-Optimized
36
SUM Aggregation Accuracy of Different
Methods on the Real-World Dataset
Sampling_2%
Sampling_20%
(Equi-Depth) MD-Histogram
Equi-Depth
Unbiased V-Optimized
Weighted V-Optimized
37
Outline
• Data Management Support
– Supporting a Light-Weight Data Management Layer
Over HDF5
– SAGA: Array Storage as a DB with Support for
Structural Aggregations
– Approximate Aggregations Using Novel Bitmap
Indices
• Data Processing Support
– SciMATE: A Novel MapReduce-Like Framework for
Multiple Scientific Data Formats
• Future Work
38
Scientific Data Analysis Today
• “Store-First-Analyze-After”
– Reload data into another file system
• E.g., load data from PVFS to HDFS
– Reload data into another data format
• E.g., load NetCDF/HDF5 data to a specialized format
• Problems
– Long data migration/transformation time
– Stresses network and disks
39
System Overview
• Key Feature
– scientific data processing module
40
Scientific Data Processing Module
41
Parallel Data Processing Times on 16
GB Datasets
• KNN
• K-Means
NetCDF
70
HDF5
NetCDF
HDF5
140
FLAT
FLAT
120
Avg. Data Processing Times (sec)
Avg. Data Processing Times (sec)
60
160
50
100
40
30
20
10
0
1
2
# of Threads
4
8
80
60
40
20
0
1
2
4
8
16
# of Nodes
42
Future Work Outline
• Data Management Support
– SciSD: Novel Subgroup Discovery over
Scientific Datasets Using Bitmap Indices
– SciCSM: Novel Contrast Set Mining over
Scientific Datasets Using Bitmap Indices
• Data Processing Support
– StreamingMATE: A Novel MapReduce-Like
Framework Over Scientific Data Stream
43
SciSD
• Subgroup Discovery
– Goal: identify all the subsets that are significantly
different from the entire dataset/general population,
w.r.t. a target variable
– Can be widely used in scientific knowledge discovery
• Novelty
– Subsets can involve dimensional and/or value ranges
– All numeric attributes
– High efficiency by frequent bitmap-based approximate
aggregations
44
Running Example
45
SciCSM
• “Sometimes it’s good to contrast what you like
with something else. It makes you appreciate
it even more.” - Darby Conley, Get Fuzzy, 2001
• Contrast Set Mining
– Goal: identify all the filters that can generate
significantly different subsets
– Common filters: time periods, spatial areas, etc.
– Usage: classifier design, change detection, disaster
prediction, etc.
46
Running Example
47
StreamingMATE
• Extend the precursor system SciMATE to
process scientific data stream
• Generalized Reduction
– Reduce data stream to a reduction object
– No shuffling or sorting
• Focus on the load balancing issues
– Input data volume can be highly variable
– Topology update: add/remove/update streaming
operators
48
StreamingMATE Overview
49
50
False:
nullify the
condition list
Hyperslab Selector
True:
nullify the
elementary
condition
4-dim Salinity Dataset
dim1: time [0, 1023]
dim2: cols [0, 166]
dim3: rows [0, 62]
dim4: layers [0, 33]
Fill up all the
index boundary
values
51
Type2 and Type3 Query Examples
52
Aggregation Query Examples
• AG1: Simple global aggregation
• AG2: GROUP BY clause + HAVING clause
• AG3: GROUP BY clause
53
Sequential and Parallel Performance of
Aggregation Queries
140
120
OPeNDAP
Execution Times (sec)
2 nodes
100
4 nodes
80
60
8 nodes
16 nodes
40
20
0
AGGI
AGGII
AGIII
Aggregation Types
54
Array Databases
• Examples: SciDB, RasDaMan and MonetDB
• Take Array as the First-Class Citizens
– Everything is defined in the array dialect
• Lightweight or No ACID Maintenance
– No write conflict: ACID is inherently guaranteed
• Other Desired Functionality
– Structural aggregations, array join, provenance…
55
Structural Aggregations
• Aggregate the elements based on positional
relationships
– E.g., moving average: calculates the average of
each 2 × 2 square from left to right
1
2
3
4
5
6
7
8
Input Array
3.5 4.5 5.5
Aggregation Result
aggregate the elements in the same square at a time
56
Coarse-Grained Partitioning
• Pros
– Low I/O cost
– Low communication cost
• Cons
– Workload imbalance for skewed data
57
Fine-Grained Partitioning
• Pros
– Excellent workload balance for skewed data
• Cons
– Relatively high I/O cost
– High communication cost
58
Hybrid Partitioning
• Pros
– Low communication cost
– Good workload balance for skewed data
• Cons
– High I/O cost
59
Auto-Grained Partitioning
• 2 Steps
– Estimate the grid density (after filtering) by
sampling, and thus, estimate the computation
cost (based on the time complexity)
• For each grid, total processing cost = constant loading
cost + varying computation cost
– Partitions the cost array - Balanced Contiguous
Multi-Way Partitioning
• Dynamic programming (small # of grids)
• Greedy (large # of grids)
60
Auto-Grained Partitioning (Cont’d)
• Pros
– Low I/O cost
– Low communication cost
– Great workload balance for skewed data
• Cons
– Overhead of sampling an runtime partitioning
61
Partitioning Strategy Summary
Strategy
I/O
Performance
Workload Scalability Additional
Balance
Cost
Coarse-Grained
Excellent
Poor
Excellent
None
Fine-Grained
Poor
Excellent
Poor
None
Hybrid
Poor
Good
Good
None
Auto-Grained
Great
Great
Great
Nontrivial
Our partitioning strategy decider can help choose the best strategy
62
All-Reuse Approach (Cont’d)
• Key Insight
– # of aggregates ≤ # of queried elements
– More computationally efficient to iterate over
elements and update the associated aggregates
• More Benefits
– Load balance (for hierarchical/circular
aggregations)
– More speedup for compound array elements
• The data type of an aggregate is usually primitive, but
this is not always true for an array element
63
Parallel Grid Aggregation Performance
• Used 4 processors on a Real-Life Dataset of 8 GB
• User-Defined Aggregation: K-Means
– Vary the number of iterations to vary to the computation amount
250
Coarse-Grained
Aggregation Times (secs)
200
Fine-Grained
Hybrid
Auto-Grained
150
100
50
0
2
4
6
# of Iterations
8
10
12
64
Data Access Strategies and Patterns
• Full Read: probably too expensive for reading
a small data subset
• Partial Read
– Strided pattern
– Column pattern
– Discrete point pattern
65
Indexing Cost of Different Binning
Strategies with Varying # of Bins on the
Synthetic Dataset
3,500
Equi-Width
Indexing Times (secs)
3,000
Equi-Depth
2,500
V-Optimized
2,000
1,500
1,000
500
0
50
100
200
400
800
# of Bins
66
SUM Aggregation of Equi-Width
Binning with Varying # of Bins on the
Synthetic Dataset
1.0E+04
50
100
200
1.0E+03
Avg. Relative Error
400
800
1.0E+02
1.0E+01
1.0E+00
0.005%
0.01%
0.02%
0.05%
0.1%
0.2%
0.5%
1%
2%
Coverage
67
SUM Aggregation of Equi-Depth
Binning with Varying # of Bins on the
Synthetic Dataset
1.0E+03
50
100
1.0E+02
200
Avg. Relative Erorr
400
800
1.0E+01
1.0E+00
1.0E-01
1.0E-02
0.005%
0.01%
0.02%
0.05%
0.1%
0.2%
0.5%
1%
2%
Coverage
68
SUM Aggregation of V-Optimized
Binning with Varying # of Bins on the
Synthetic Dataset
0.35
50
Avg. Relative Erorr
0.30
100
200
0.25
400
800
0.20
0.15
0.10
0.05
0.00
0.005%
0.01%
0.02%
0.05%
0.1%
0.2%
0.5%
1%
2%
Coverage
69
Average Relative Error(%) of MAX
Aggregation of Different Methods on
the Real-World Dataset
70
SUM Aggregation Times of Different
Methods on the Real-World Dataset
(DB)
1.0E+04
Aggregation Times (secs)
Sampling_2%
Sampling_20%
Equi-Depth
V-Optimized
Precise Aggregation
1.0E+03
1.0E+02
1.0E+01
1.0E+00
10%
20%
30%
40%
50%
60%
70%
80%
90%
Coverage
71
SUM Aggregation Times of Different
Methods on the Real-World Dataset
(VBD)
1.0E+04
Sampling_2%
Sampling_20%
Equi-Depth
V-Optimized
Precise Aggregation
Aggregation Times (secs)
1.0E+03
1.0E+02
1.0E+01
1.0E+00
1.0E-01
1.0E-02
1.0E-03
1.0E-04
10%
20%
30%
40%
50%
60%
70%
80%
90%
Coverage
72
SUM Aggregation Times of Different
Methods on the Real-World Dataset
(VBS)
1.0E+04
Sampling_2%
Sampling_20%
Equi-Depth
V-Optimzed
Precise Aggregation
Aggregation Times (secs)
1.0E+03
1.0E+02
1.0E+01
1.0E+00
1.0E-01
1.0E-02
1.0E-03
1.0E-04
0.01%
0.02%
0.05%
0.1%
0.2%
0.5%
1%
2%
5%
Coverage
73
SUM Aggregation Times of Different
Methods on the Real-World Dataset
(CD)
Aggregation Times (secs)
1.0E+04
Sampling_2%
Sampling_20%
Equi-Depth
Unbiased V-Optimized
Weighted V-Optimized
Precise Aggregation
1.0E+03
1.0E+02
1.0E+01
1.0E+00
10%
20%
30%
40%
50%
60%
70%
80%
90%
Coverage
74
SUM Aggregation Times of Different
Methods on the Real-World Dataset
(CD)
Sampling_2%
Sampling_20%
Equi-Depth
V-Optimzed
Precise Aggregation
Aggregation Times (secs)
1.0E+03
1.0E+02
1.0E+01
1.0E+00
1.0E-01
0.005%
0.01%
0.02%
0.05%
0.1%
0.2%
0.5%
1%
2%
Coverage
75
SD vs. Classification
76